[PDF] A Similarity-preserving Neural Network Trained on Transformed Images Recapitulates Salient Features of the Fly Motion Detection Circuit

Abstract

Learning to detect content-independent transformations from data is one of the central problems in biological and artificial intelligence. An example of such problem is unsupervised learning of a visual motion detector from pairs of consecutive video frames. Rao and Ruderman formulated this problem in terms of learning infinitesimal transformation operators (Lie group generators) via minimizing image reconstruction error. Unfortunately, it is difficult to map their model onto a biologically plausible neural network (NN) with local learning rules. Here we propose a biologically plausible model of motion detection. We also adopt the transformation-operator approach but, instead of reconstruction-error minimization, start with a similarity-preserving objective function. An online algorithm that optimizes such an objective function naturally maps onto an NN with biologically plausible learning rules. The trained NN recapitulates major features of the well-studied motion detector in the fly. In particular, it is consistent with the experimental observation that local motion detectors combine information from at least three adjacent pixels, something that contradicts the celebrated Hassenstein-Reichardt model.

Full PDF

AA Similarity-preserving Neural Network Trained onTransformed Images Recapitulates Salient Featuresof the Fly Motion Detection Circuit

Yanis Bahroun † Anirvan M. Sengupta †‡ Dmitri B. Chklovskii †∗†

Flatiron Institute ‡ Rutgers University ∗ NYU Langone Medical Center{ybahroun,dchklovskii}@ﬂatironinstitute.org, [email protected],

Abstract

Learning to detect content-independent transformations from data is one of thecentral problems in biological and artiﬁcial intelligence. An example of such prob-lem is unsupervised learning of a visual motion detector from pairs of consecutivevideo frames. Rao and Ruderman formulated this problem in terms of learninginﬁnitesimal transformation operators (Lie group generators) via minimizing imagereconstruction error. Unfortunately, it is difﬁcult to map their model onto a biologi-cally plausible neural network (NN) with local learning rules. Here we propose abiologically plausible model of motion detection. We also adopt the transformation-operator approach but, instead of reconstruction-error minimization, start with asimilarity-preserving objective function. An online algorithm that optimizes suchan objective function naturally maps onto an NN with biologically plausible learn-ing rules. The trained NN recapitulates major features of the well-studied motiondetector in the ﬂy. In particular, it is consistent with the experimental observationthat local motion detectors combine information from at least three adjacent pixels,something that contradicts the celebrated Hassenstein-Reichardt model.

Humans can recognize objects, such as human faces, even when presented at various distances, fromvarious angles and under various illumination conditions. Whereas the brain performs such a taskalmost effortlessly, this is a challenging unsupervised learning problem. Because the number oftraining views for any given face is limited, such transformations must be learned from data com-prising different faces, or in a content-independent manner. Therefore, learning content-independenttransformations plays a central role in reverse engineering the brain and building artiﬁcial intelligence.Perhaps the simplest example of this task is learning a visual motion detector, which computes theoptic ﬂow from pairs of consecutive video frames regardless of their content. Motion detector learningwas addressed by Rao and Ruderman [32] who formulated this problem as learning inﬁnitesimaltranslation operators (or generators of the translation Lie group). They learned a motion detector byminimizing, for each pair of consecutive video frames, the squared mismatch between the observedvariation in pixel intensity values and that predicted by the scaled inﬁnitesimal translation operator.Whereas such an approach learns the operators and evaluates transformation magnitudes correctly[32, 23, 44], its biological implementation has been lacking (see below).The non-biological nature of the neural networks (NNs) derived from the reconstruction approach hasbeen previously encountered in the context of discovery of latent degrees of freedom, e.g. dimension-ality reduction and sparse coding [8, 27]. When such NNs are derived from the reconstruction-error-minimization objective they require non-local learning rules, which are not biologically plausible. To a r X i v : . [ c s . N E ] F e b vercome this, [29, 30, 31] proposed deriving NNs from objectives that strive to preserve similaritybetween pairs of inputs in corresponding outputs.Inspired by [30, 31], we propose a similarity-preserving objective for learning inﬁnitesimal translationoperators. Instead of preserving similarity of input pairs as was done for dimensionality reductionNNs, our objective function preserves the similarity of input features formed by the outer productof variation in pixel intensity and pixel intensity which are suggested by the translation-operatorformalism. Such objective is optimized by an online algorithm that maps onto a biologically plausibleNN. After training the similarity-preserving NN on one-dimensional (1D) and two-dimensional (2D)translations, we obtain an NN that recapitulates salient features of the ﬂy motion detection circuit.Thus, our main contribution is the derivation of a biologically plausible NN for learning content-independent transformations by similarity preservation of outer product input features. We start by reviewing the NNs for discovery of latent degrees of freedom from principled objectivefunctions. Although these NNs do not detect transformations, they provide a useful analogy thatwill be important for understanding our approach. First, we explain why the NNs derived fromminimizing the reconstruction error lack biological plausibility. Then, we show how the NNs derivedfrom similarity preservation objectives solve this problem.To introduce our notation, the input to the NN is a set of vectors, x t ∈ R n , t = 1 , . . . , T , withcomponents represented by the activity of n upstream neurons at time, t . In response, the NN outputsan activity vector, y t ∈ R m , t = 1 , . . . , T , where m is the number of output neurons.The reconstruction approach starts with minimizing the squared reconstruction error: min W , y t =1 ...T ∈ R m (cid:88) t || x t − Wy t || = min W , y t =1 ...T ∈ R m T (cid:88) t =1 (cid:104) (cid:107) x t (cid:107) − x (cid:62) t Wy t + y (cid:62) t W (cid:62) Wy t (cid:105) , (1)possibly subject to additional constraints on the latent variables y t or on the weights W ∈ R n × m .Without additional constraints, this objective is optimized ofﬂine by a projection onto the principalsubspace of the input data, of which PCA is a special case [25].In an online setting, the objective can be optimized by alternating minimization [27]. After the arrivalof data sample, x t : ﬁrstly, the objective (1) is minimized with respect to the output, y t , while theweights, W , are kept ﬁxed, secondly, the weights are updated according to the following learningrule derived by a gradient descent with respect to W for ﬁxed y t : ˙ y t = W (cid:62) t − x t − W (cid:62) t − W t − y t , W t ←− W t − + η ( x t − W t − y t ) y (cid:62) t , (2)In the NN implementations of the algorithm (2), the elements of matrix W are represented by synapticweights and principal components by the activities of output neurons y j , Fig. 1a [24].However, implementing update (2)right in the single-layer NN architecture, Fig. 1a, requires non-local learning rules making it biologically implausible. Indeed, the last term in (2)right implies thatupdating the weight of a synapse requires the knowledge of output activities of all other neurons whichare not available to the synapse. Moreover, the matrix of lateral connection weights, − W (cid:62) t − W t − ,in the last term of (2)left is computed as a Gramian of feedforward weights; a non-local operation.This problem is not limited to PCA and arises in nonlinear NNs as well [27, 19].Whereas NNs with local learning rules have been proposed [27] their two-layer feedback architectureis not consistent with most biological sensory systems with the exception of olfaction [17]. Mostimportantly, such feedback architecture seems inappropriate for motion detection which requiresspeedy processing of streamed stimuli.To address these difﬁculties, [30] derived NNs from similarity-preserving objectives. Such objectivesrequire that similar input pairs, x t and x t (cid:48) , evoke similar output pairs, y t and y t (cid:48) . If the similarity ofa pair of vectors is quantiﬁed by their scalar product, one such objective is similarity matching (SM): min ∀ t ∈{ ,...,T } : y t ∈ R m T (cid:88) t,t (cid:48) =1 ( x t · x t (cid:48) − y t · y t (cid:48) ) . (3)2his ofﬂine optimization problem is also solved by projecting the input data onto the principalsubspace [46, 5, 20]. Remarkably, the optimization problem (3) can be converted algebraically to atractable form by introducing variables W and M [31]: min { y t ∈ R m } Tt =1 min W ∈ R n × m max M ∈ R m × m [ T (cid:88) t =1 ( − x (cid:62) t Wy t + y (cid:62) t My t )+ T Tr( W (cid:62) W ) − T M (cid:62) M )] . (4)In the online setting, ﬁrst, we minimize (4) with respect to the output variables, y t , by gradientdescent while keeping W , M ﬁxed [30]: ˙ y t = W (cid:62) x t − My t . (5)To ﬁnd y t after presenting the corresponding input, x t , (5) is iterated until convergence. After theconvergence of y t , we update W and M by gradient descent and gradient ascent respectively [30]: W ij ← W ij + η ( x i y j − W ij ) , M ij ← M ij + η ( y i y j − M ij ) . (6)Algorithm (5), (6) can be implemented by a biologically plausible NN, Fig. 1b. As before, activity(ﬁring rate) of the upstream neurons encodes input variables, x t . Output variables, y t , are computedby the dynamics of activity (5) in a single layer of neurons. The elements of matrices W and M are represented by the weights of synapses in feedforward and lateral connections respectively. Thelearning rules (6) are local, i.e. the weight update, ∆ W ij , for the synapse between i th input neuronand j th output neuron depends only on the activities, x i , of i th input neuron and, y j , of j th outputneuron, and the synaptic weight. Learning rules (6) for synaptic weights W and − M (here minusindicates inhibitory synapses, see Eq.(5)) are Hebbian and anti-Hebbian respectively. - MW . . . x x n . . . x y y k . . . A B anti-Hebbian synapsesHebbian Rectification - MW . . . x x n . . . x y y k - W . . . x x n . . . x y y k A B W T W xxx C anti-Hebbian synapsesHebbian Non-local Principal - MW . . . x x n . . . x y y k . . . A B anti-Hebbian synapsesHebbian Rectification (a) (b) (c)

Figure 1: Single-layer NNs performing online (a) reconstruction error minimization (1) [24, 27], (b)similarity matching (SM) (3) [30], and (c) nonnegative similarity matching (NSM) (7) [29].We now compare the objective functions of the two approaches. After dropping invariant terms, thereconstructive objective function has the following interactions among input and output variables: − x (cid:62) t Wy t + y (cid:62) t W (cid:62) Wy t (Eq 1). The SM approach leads to − x (cid:62) t Wy t + y (cid:62) t My t , ( Eq 4). Theterm linear in y t , a cross-term between inputs and outputs, − x (cid:62) t Wy t , is common in both approachesand is responsible for projecting the data onto the principal subspace via the feedforward connectionsin Fig.1ab. The terms quadratic in y t ’s decorrelate different output channels via a competitionimplemented by the lateral connections in Fig.1ab and are different in the two approaches. Inparticular, the inhibitory interaction between neuronal activities y j in the reconstruction approachdepends upon W (cid:62) W , which is tied to trained W in a non-local way. In contrast, in the SM approachthe inhibitory interaction matrix M is learned for y j ’s via a local anti-Hebbian rule.The SM approach can be applied to other computational tasks such as clustering and learningmanifolds by tiling them with localized receptive ﬁelds [35]. To this end we modify the ofﬂineoptimization problem (3) by constraining the output, y t ∈ R m + , which represents assignment indices(as e.g. in the K-means algorithm): min ∀ t ∈{ ,...,T } : y t ∈ R m + T (cid:88) t,t (cid:48) =1 ( x t · x t (cid:48) − y t · y t (cid:48) ) . (7)Such nonnegative SM (NSM), just like the optimization problem (3), (7) can be converted alge-braically to a tractable form by introducing similar variables W and M [29]. The synaptic weightupdate rules presented in (6) remain unchanged and the only difference between the online solutionsof (3) and (7) is the dynamics of neurons which, instead of being linear, are now rectifying, Fig. 1c.3n the next section, we will address transformation learning. Similarly, we will review the recon-struction approach, identify the key term analogous to the cross-term − x (cid:62) t Wy t , and then alter theobjective function, so that the cross-term is preserved but the inhibition between output neurons canbe learned in a biologically plausible manner. Now, we focus on learning to detect transformations from pairs of consecutive video frames, x t , and x t +1 . We start with the observation that much of the change in pixel intensities in consecutive framesarises from a translation of the image. For inﬁnitesimal translations, pixel intensity change is given bya linear operator (or matrix), denoted by A a , multiplying the vector of pixel intensity scaled by themagnitude of translation, denoted by θ a . Because for a 2D image multiple directions of translationare possible, there is a set of translation matrices with corresponding magnitudes. Our goal is to learnboth the translation matrices from pairs of consecutive video frames and compute the magnitudes oftranslations for each pair. Such a learning problem will reduce to the one discussed in the previoussection, but performed on an unusual feature – the outer product of pixel intensity and variation ofpixel intensity vectors. We represent a video frame at time, t , by the pixel intensity vector, x t , formed by reshaping an imagematrix into a vector. For inﬁnitesimal transformations, the difference, ∆ x t , between two consecutiveframes, x t and x t +1 is: ∆ x t = x t +1 − x t = K (cid:88) a =1 θ at A a x t , ∀ t ∈ { , . . . , T − } . (8)where, for each transformation, a ∈ { , . . . K } , between the frames, t and t + 1 , we deﬁne atransformation matrix A a and a magnitude of transformations, θ at . Whereas for image translation A a is known to implement a spatial derivative operator, we are interested in learning A a from data inunsupervised fashion.Previously, unsupervised algorithms for learning both A a and θ at were derived by minimizing withrespect to A a and θ at the prediction-error squared [32] where optimal A a and θ at minimize themismatch between the actual image and the one computed based on the learned model: (cid:88) t (cid:107) ∆ x t − K (cid:88) a =1 θ at A a x t (cid:107) = (cid:88) t (cid:104) (cid:107) ∆ x t (cid:107) − x (cid:62) t K (cid:88) a =1 θ at A a x t + (cid:107) K (cid:88) a =1 θ at A a x t (cid:107) (cid:105) . (9)Whereas solving (9) in the ofﬂine setting leads to reasonable estimates of A a and θ at [32], it is rathernon-biological. In a biologically plausible online setting the data are streamed sequentially and θ at ( A a ) must be computed (updated) with minimum latency. The algorithm can store only the latestpair of images and a small number of variables, i.e. sufﬁcient statistic, but not any signiﬁcant part ofthe dataset. Although a sketch of neural architecture was proposed in [32], it is clear from Section1.1 that due to the quadratic term in the output, θ at , a detailed architecture will suffer from the samenon-locality as the reconstruction approach to latent variable NNs (1).As the cross-term in (9) plays a key role in projecting the data (Section 1.1), we re-write it as follows: (cid:88) t ∆ x (cid:62) t K (cid:88) a =1 θ at A a x t = (cid:88) i,j,t,a ∆ x t,i θ at A ai,j x t,j = (cid:88) i,j,t,a θ at A ai,j ∆ x t,i x t,j = (cid:88) t Θ t A Vec (∆ x t x (cid:62) t ) , (10)where we introduced A ∈ R K × n , the matrix whose components represents the vectorized versionof the generators, A a, : = Vec ( A a ) , ∀ a ∈ { , . . . , K } and Θ t = ( θ a = { ...K } t ) (cid:62) , the vector whosecomponents represent the magnitude of the transformation, a , at time, t .Eq. (10) shows that the cross-term favors aligning A a, : in the direction of the outer product of pixelintensity variation and pixel intensity vectors, Vec (∆ xx (cid:62) ) . Although central to the learning oftransformations in (9), the outer product of pixel intensity variation and pixel intensity vectors wasnot explicitly highlighted in the transformation-operator learning approach [32, 10, 23].4 .2 Why the outer product of pixel intensity variation and pixel intensity vectors? Here, we provide intuitions for using outer products in content-independent detection of translations.For simplicity, we consider 1D motion in a 1D world. Motion detection relies on a correspondencebetween consecutive video frames, x t and x t +1 .One may think that such correspondences can be detected by a neuron adding up responses of thedisplaced ﬁlters applied to x t and x t +1 . While possible in principle, such neuron’s response wouldbe highly dependent on the image content [21, 22]. This is because summing the outputs of the twoﬁlters amounts to applying an OR operation to them which does not selectively respond to translation.To avoid such dependence on the content, [21] proposed to invoke an AND operation, which isimplemented by multiplication. Speciﬁcally, consider forming an outer product of x t and x t +1 andsumming its values along each diagonal. If the image is static then the main diagonal producesthe highest correlation. If the image is shifted by one pixel between the frames then the ﬁrstsub(super)-diagonal yields the highest correlation. If the image is shifted by two pixels - the secondsub(super)-diagonal yields the highest correlation and so on. Then, if the sum over each diagonalis represented by a different neuron, the velocity of the object is given by the most active neuron.Other models relying on multiplications are "mapping units" [15], "dynamic mappings" [43] andother bilinear models [26].Our algorithm for motion detection adopts multiplication to detect correspondences but computes anouter product between the vectors of pixel intensity, x t , and pixel intensity variation , ∆ x t . Comparedto the approach in [21], one advantage of our approach is that we do not require separate neurons torepresent different velocities but rather have a single output neuron (for each direction of motion),whose activity increases with velocity. Previously, a similar outer product feature was proposed in[3] (for a formal connection - see Supplement A ). Another advantage of our approach is a derivationfrom the principled SM objective motivated by the transformation-operator formalism. Having identiﬁed the cross-term in (9) analogous to that in (1), we propose a novel objective functionwhere the inhibition between output neurons is learned in a biologically plausible manner. By analogywith (Eq.3), we substitute the reconstruction-error-minimization objective by an SM objective fortransformation learning. We denote the vectorized outer product between ∆ x t and x t as χ t ∈ R n : χ t,α = (∆ x t x (cid:62) t ) i,j , with α = ( i − n + j, (11)We concatenate these vectors into a matrix, χ ≡ [ χ , . . . , χ T ] , as well as the transformation magni-tude vectors, Θ ≡ [Θ , . . . , Θ T ] . Using these notations, we introduce the following SM objective: min Θ ∈ R K × T (cid:107) χ (cid:62) χ − Θ (cid:62) Θ (cid:107) F = min Θ ,..., Θ T T T (cid:88) t T (cid:88) t (cid:48) ( χ (cid:62) t χ t (cid:48) − Θ (cid:62) t Θ t (cid:48) ) . (12)To reconcile (9) and (12), we ﬁrst show that the cross-terms are the same by introducing the followingoptimization over a matrix, W ∈ R K × n as: T T (cid:88) t =1 T (cid:88) t (cid:48) =1 Θ (cid:62) t Θ t (cid:48) χ (cid:62) t χ t (cid:48) = 1 T T (cid:88) t =1 Θ (cid:62) t (cid:34) T (cid:88) t (cid:48) =1 Θ t (cid:48) χ (cid:62) t (cid:48) (cid:35) χ t = max W T T (cid:88) t =1 Θ (cid:62) t W χ t − Tr W (cid:62) W (13)Therefore, the SM approach yields the cross-term, Θ (cid:62) t W χ t which is the same as Θ (cid:62) t A Vec (∆ x t x (cid:62) t ) in [32]. We can thus identify the rows W a, : with the vectorized transformation matrices, Vec ( A a ) ,Fig. 2a. Solutions of (12) are known to be projections onto the principal subspace of χ , the vectorizedouter product of ∆ x t and x t which are equivalent, up to an orthogonal rotation, to PCA.If we constrain the output to be nonnegative (NSM): min Θ ∈ R K × T + (cid:107) χ (cid:62) χ − Θ (cid:62) Θ (cid:107) F . (14)then by analogy with Sec. 1.1 [29], this objective function clusters data or tiles data manifolds [35].5 .4 Online algorithm and NN To derive online learning algorithms for (12) and (14) we follow the similarity matching approach [30].The optimality condition of each online problem is given by [29, 30] for SM and NSM respectively:SM: Θ ∗ t = W χ t − M Θ ∗ t ; NSM: Θ ∗ t = max( W χ t − M Θ ∗ t , , (15)with W and M found using recursive formulations, ∀ a ∈ { , . . . , K } , ∀ α ∈ { , . . . , n } : W aα ← W aα + (cid:18) Θ t − ,a ( χ t − ,α − W aα Θ t − ,a ) (cid:30) ˆΘ t,a (cid:19) (16) M aa (cid:48) (cid:54) = a ← M aa (cid:48) + (cid:18) Θ t − ,a (Θ t − ,a (cid:48) − M aa (cid:48) Θ t − ,a ) (cid:30) ˆΘ t,a (cid:19) (17) ˆΘ t,a = ˆΘ t − ,a + (Θ t − ,a ) . (18)This algorithm is similar to the model proposed in [30], but it is more difﬁcult to implement ina biologically plausible way. This is because χ t is an outer product of input data and cannot beidentiﬁed with the inputs to a single neuron. To implement this algorithm, we break up W into rank-1components, each of which is computed in a separate neuron such that: Θ ∗ t,a = (cid:88) i ∆ x t,i (cid:88) j W ija x t,j − (cid:88) a (cid:48) M aa (cid:48) Θ ∗ t,a (cid:48) . (19)Each element of the tensor, W ija will be encoded in the weight of a feedforward synapse fromthe j -th pixel onto i -th neuron encoding a -th transformation (see Fig. 2a). Biologically plausibleimplementations of this algorithm are given in Section 3. Here, we implement the biologically plausible algorithms presented in the previous subsection andreport the learned transformation matrices. To validate the results of SM and NSM applied to theouter-product feature, χ , we compare them with those of PCA and K-means, respectively, also appliedto χ as formally deﬁned in in Supplement B . These standard but biologically implausible algorithmswere chosen because they perform similar computations in the context of latent variable discovery.The 1D visual world is represented by a continuous proﬁle of light intensity as a function of onecoordinate. A 1D eye measures light intensity in a 1D window consisting of n discrete pixels. Toimitate self-motion, such window can move left and right by a fraction of a pixel at each time step.For the purpose of evaluating the proposed algorithms and derived NNs, we generated artiﬁcialtraining data by subjecting a randomly generated 1D image (Gaussian, exponentially correlated noise)to known horizontal subpixel translations. Then, we spatially whitened the discrete images by usingthe ZCA whitening technique [2].We start by learning K = 2 transformation matrices using each algorithm. After the rows of thesynaptic weights, W, are reshaped into n × n matrices, they can be identiﬁed with the transformationoperators, A . Then the magnitude of the transformation given by ∆ x (cid:62) t Ax t , Fig. 2a. SM and PCA.

The ﬁlters learned from SM are shown in Fig.2c and those learned from PCA - inFig.2e. The left panels of Fig.2ce represent the singular vectors capturing the maximum variance.They replicate the known operator of translation, a spatial derivative, found in [32]. The right panelsof Fig.2ce show the singular vector capturing the second largest variance, which do not account for aknown transformation matrix. In the absence of a nonnegativity constraint a reversal of translation isrepresented by a change of sign of the transformation magnitude.

NSM and K-means.

The ﬁlters learned by NSM are shown in Fig.2d and those learned by K-means- in Fig. 2f. They are similar to the ﬁrst singular vector learned by SM, PCA and [32]. However, inNSM and K-means the output must be nonnegative, so representing the opposite directions of motionrequires two ﬁlters, which are sign inversions of each other.For the various models, the rows of the learned operators, A a , are identical except for a shift, i.e. thesame operator is applied at each image location. As expected, the learned ﬁlters compute a spatialderivative of the pixel intensity, red rectangle in Fig.2a. The learned weights can be approximated by6 e) (f)(c) Δ x ",: x ",: − x ",% x ",& - x ",' x ",% - x ",( x ",' − x ",) x ",( ≈ [ Δ x ",& Δ x ",% Δ x ",' Δ x ",( Δ x ",) ] x [ Δ x ",& Δ x ",% Δ x ",' Δ x ",( Δ x ",) ] x ",& x ",% x ",' x ",( x ",) W +,: (a) (d) x !, x !,$ x !,% &&' + - Δ x !, ×( x !, − x !,% ) (b) Figure 2: The rows of the synaptic weight matrix W a, : are reshaped into n × n transformationmatrices A a . Then, the magnitude of the transformation is ∆ x (cid:62) t A a x t . Such a computation can beapproximated by the cartoon model (b). Synaptic weights learned from 1D translation on a vector ofsize 5 pixels by (c) SM, (d) NSM, (e) PCA (decreasing eigenvalues), and (f) K-means.the ﬁlter keeping only the three central pixels, Fig.2 which we name the cartoon model of the motiondetector. It computes a correlation between the spatial derivative denoted by ∆ i x t,i and the temporalderivative, ∆ t x t . Such algorithm may be viewed as a Bayesian optimal estimate of velocity in thelow SNR regime (Supplement C ) appropriate for the ﬂy visual system[38].The results presented in Fig.2 were obtained with n = 5 pixels, but the same structure was observedwith larger values of n . Similar results were also obtained with models trained on moving periodicsine-wave gratings often used in ﬂy experiments.We also trained our NN on motion in the four cardinal directions, and planar rotations of two-dimensional images as was done in [32] and showed that our model can learn such transformations.By using NSM we can again distinguish between motion in the four cardinal directions, and clockwiseand counterclockwise rotations, which was not possible with prior approaches (see Supplement D ). In this section, we propose two biologically plausible implementations of a motion detector by takingadvantage of the decomposition of the outer product feature matrix into single-row components (19).The ﬁrst implementation models computation in a mammalian neuron such as a cortical pyramidalcell. The second models computation in a Drosophila motion-detecting neuron T4 (same argumentsapply to T5). In the following, for simplicity we focus on the cartoon model Fig.2b.

Mammalian neurons can implement motion computation by representing each row of the trans-formation matrix, W, in a different dendritic branch originating from the soma (cell body). Eachsuch branch forms a compartment with its own membrane potential [14, 39] allowing it to performits own non-linear computation the results of which are then summed in the soma. Each dendritecompartment receives pixel intensity variation from only one pixel via a proximal shunting inhibitorysynapse [42, 16] and the pixel intensity vector via more distal synapses, Fig. 3a. We assume thatthe conductance of the shunting inhibitory synapse decreases with the variation in pixel intensity.The weights of the more distal synapses represent the corresponding row of the outer product featurematrix. When the variation in pixel intensity is low, the shunting inhibition vetoes other post-synapticcurrents. When the variation in pixel intensity is high, the shunting is absent and the remainingpost-synaptic currents ﬂow into the soma. A formal analysis shows that this operation can be viewedas a multiplication [42, 16]. Different compartments compute such products for variation in intensityof different pixels, after which these products are summed in the soma (19), Fig. 3a.The weight of a distal synapse is updated using a Hebbian learning rule applied to the correspondingpixel intensity available pre-synaptically and the transformation magnitude modulated by the shuntinginhibition representing pixel intensity variation, Fig. 3b. The transformation magnitude is computed inthe soma and reaches distal synapses via backpropagating dendritic spikes [40]. Such backpropagatingsignal is modulated by the shunting inhibition, thus implementing multiplication of the transformation7agnitude and pixel intensity variation (16), Fig. 3b . Competition between the neurons detectingmotion in different directions is mediated by inhibitory interneurons [28]. (a) Soma ! " a x i x i x i %$ ∆ x i x i %' x i x i x i x i %$ ∆ x i %$ x i %' x i (b)Soma ! " a Shunting InhibitionExcitatory connection Inhibitory connection x i x i x i %$ ∆ x i x i %' x i x i x i x i %$ ∆ x i %$ x i %' x i Θ ∗ t,a = ! i ∆ x t,i ! j W ija x t,j − ! a ′ M aa ′ Θ ∗ t,a ′ . Dendritic Backpropagation

Neural activity computed according to (19) Synaptic update by dendritic backpropagation according to (16-18)

Figure 3: A multi-compartment model of a mammalian neuron. (a) Each dendrite multiplies pixelintensity variation signaled by the shunting inhibitory synapse and the weighted vector of pixelintensities carried by more distal synapses. Products computed in each dendrite are summed in thesoma to yield transformation magnitude encoded in the spike rate. (b) Synaptic weights are updatedby the product of the corresponding pre-synaptic pixel intensities and the backpropagating spikesmodulated by the shunting inhibition.

The Drosophila visual system comprises retinotopically organized layers of neurons, meaning thatnearby columns process photoreceptor signals (identiﬁed with x i below) from nearby locationsin the visual ﬁeld. Unlike the implementation in the previous subsection, motion computation isperformed across multiple neurons. The local motion signal is ﬁrst computed in each of the hundredsof T4 neurons that jointly tile the visual ﬁeld. Their outputs are integrated by the downstream gianttangential neurons. Each T4 neuron receives light intensity variation from only one pixel via synapsesfrom neurons Mi1 and Tm3 and light intensities from nearby pixels via synapses from neurons Mi4and Mi9 (with opposite signs) [41], Fig. 3c. Therefore, in each T4 neuron ∆ x is a scalar and W is avector and local motion velocity can be computed by a single-compartment neuron. If the weights ofsynapses from Mi4 and Mi9 of different columns represent W , then the multiplication of ∆ x and Wx can be accomplished as before using shunting inhibition. Competition among T4s detectingdifferent directions of motion is implemented by inhibitory lateral connections. Mi1/Tm3 Mi9 Mi4Column 1i Mi1/Tm3 Mi9 Mi4e Column 2 Mi1/Tm3 Mi9 Mi4Column 3 i T4e ii

T4 right

Mi1/Tm3 Mi9 Mi4

Column i + 1 ii T4 left

Mi1/Tm3 Mi9 Mi4

Column i ee Mi1/Tm3 Mi9 Mi4

Column i − 1 i i ∆ x i %& x i %& ∆ x i x i ∆ x i '& x i '& x- + Pixel i − 1 Pixel i Pixel i + 1 EMD right (a) (b) (c)

Figure 4: An NN trained on 1D translations recapitulates the motion detection circuit in Drosophila.(a) Each motion-detecting neuron receives pixel intensity variation signal from pixel i and pixelintensity signals at least from pixels i − and i + 1 (with opposite signs). (b) In Drosophila, eachretinotopically organized column contains neurons Mi1/Tm3, Mi9, and Mi4 [41] which respond tolight intensity in the corresponding pixel according to the impulse responses shown in (c) (from [1]).Each T4 neuron selectively samples different inputs from different columns [41]: it receives lightintensity variation via Mi1/Tm3 and light intensity via Mi4 and Mi9 (with opposite signs).Our model correlates inputs from at least three pixels in agreement with recent experimental results[41, 1, 11, 34], instead of two in the celebrated Hassenstein-Reichardt detector (HRD)[33]. In theﬂy, outputs of T4s are summed over the visual ﬁeld in downstream neurons. The summed output ofour detectors is equivalent to the summed output of HRDs and thus consistent with multiple priorbehavioral experiments and physiological recordings from downstream neurons (see Supplement E ).There is experimental evidence for both nonlinear interactions of T4 inputs [34, 13] supporting amultiplicative model but also for the linear summation of inputs [11, 45]. Even if summation is linear,the neuronal output nonlinearity can generate multiplicative terms for outer product computation.8he main difference between our learned model (Fig.2a) and most published models is that themotion detector is learned from data using biologically plausible learning rules in an unsupervisedsetting. Thus, our model can generate somewhat different receptive ﬁelds for different natural imagestatistics such as that in ON and OFF pathways potentially accounting for minor differences reportedbetween T4 and T5 circuits [41].A recent model from [34] also uses inputs from three differently preprocessed inputs. Unlike ourmodel that relies on a derivative computation in the middle pixel, the model in [34] is composed of ashared non-delay line ﬂanked by two delay lines.As shown in Supplement E , after integration over the visual ﬁeld, the global signal from our cartoonmodel Fig.2b is equivalent to that from HRD. Same observation has been made for the model in [34].Yet, the predicted output of a single motion detector in our model is different from both HRD and[34]. Until recently, most experiments conﬁrmed the predictions of the HRD model. However, almost allof these experiments measured either the activity of downstream giant neurons integrating T4 outputover the whole visual ﬁeld or the behavioral response generated by these giant neurons. Because afterintegration over the visual ﬁeld, the global signal from our cartoon model Fig.2b is equivalent to thatfrom HRD, various experimental conﬁrmations of the HRD predictions are inherited by our model.Below, we list some of the conﬁrmed predictions.

Dependence of the output on the image contrast.

Because HRD multiplies signals from the twophotoreceptors its output should be quadratic in the stimulus contrast. Similarly, in our model, theoutput should be proportional to contrast squared because it is given by the covariance between timeand space derivatives of the light intensity Supplement C each proportional to contrast. Note thatthis prediction differs from [32] whose output is contrast-independent. Several experiments haveconﬁrmed these predictions in the low SNR regime [12, 7, 9, 37, 4]. Of course, the output cannotgrow unabated and, in the high SNR regime, the output becomes contrast independent. A likely causeis the signal normalization between photoreceptors and T4 [12]. Oscillations in the motion signal locked to the visual stimulus.

In accordance with the oscillatingoutput of HRD in response to moving periodic stimulus, physiological recordings have reported suchphase-locked oscillations [6]. Our model reproduces such oscillations.

Dependence of the peak velocity on the wavelength.

In our model, just like in the HRD, outputﬁrst increases with the velocity of the visual stimulus and then decreases. The optimal velocity isproportional to the spatial wavelength of the visual stimulus because then the temporal frequency ofthe optimal stimulus is a constant given by the inverse of the time delay in one of the arms.

In conclusion, we learn transformation matrices using a similarity-preserving approach leading to abiologically plausible model of a motion detector. Generalizing our work to the learning of othercontent-preserving transformation will open a path towards principled biologically plausible objectrecognition.

Acknowledgments

We are grateful to P. Gunn, and A. Genkin for discussion and comments on this manuscript. We thankD. Clark, J. Fitzgerald, E. Hunsicker, and B. Olshausen for helpful discussions.

References [1] Alexander Arenz, Michael S Drews, Florian G Richter, Georg Ammer, and Alexander Borst.The temporal tuning of the drosophila motion detectors is determined by the dynamics of theirinput elements.

Current Biology , 27(7):929–944, 2017.[2] Anthony J Bell and Terrence J Sejnowski. The “independent components” of natural scenes areedge ﬁlters.

Vision research , 37(23):3327–3338, 1997.[3] Matthias Bethge, Sebastian Gerwinn, and Jakob H Macke. Unsupervised learning of a steerablebasis for invariant image representations. In

Human Vision and Electronic Imaging XII , volume6492, page 64920C. International Society for Optics and Photonics, 2007.94] Erich Buchner. Elementary movement detectors in an insect visual system.

Biological cybernet-ics , 24(2):85–101, 1976.[5] Trevor F Cox and Michael AA Cox.

Multidimensional scaling . Chapman and hall/CRC, 2000.[6] Martin Egelhaaf and Alexander Borst. A look into the cockpit of the ﬂy: visual orientation,algorithms, and identiﬁed neurons.

The Journal of Neuroscience , 13(11), 1993.[7] G Fermi and W Reichardt. Optomotor reactions of the ﬂy, musca domestica. dependence ofthe reaction on wave length, velocity, contrast and median brightness of periodically movedstimulus patterns.

Kybernetik , 2:15–28, 1963.[8] Peter Földiák. Learning invariance from transformation sequences.

Neural Computation ,3(2):194–200, 1991.[9] KG Götz. Optomoter studies of the visual system of several eye mutants of the fruit ﬂydrosophila.

Kybernetik , 2(2):77, 1964.[10] David B Grimes and Rajesh PN Rao. Bilinear sparse coding for invariant vision.

Neuralcomputation , 17(1):47–73, 2005.[11] Eyal Gruntman, Sandro Romani, and Michael B Reiser. Simple integration of fast excitation andoffset, delayed inhibition computes directional selectivity in drosophila.

Nature neuroscience ,21(2):250, 2018.[12] Juergen Haag, Winfried Denk, and Alexander Borst. Fly motion vision is based on reichardtdetectors regardless of the signal-to-noise ratio.

Proceedings of the National Academy ofSciences , 101(46):16333–16338, 2004.[13] Juergen Haag, Abhishek Mishra, and Alexander Borst. A common directional tuning mechanismof drosophila motion-sensing neurons in the on and in the off pathway.

Elife , 6:e29044, 2017.[14] Michael Häusser and Bartlett Mel. Dendrites: bug or feature?

Current opinion in neurobiology ,13(3):372–383, 2003.[15] Geoffrey F Hinton. A parallel computation that assigns canonical object-based frames ofreference. In

Proceedings of the 7th international joint conference on Artiﬁcial intelligence-Volume 2 , pages 683–685. Morgan Kaufmann Publishers Inc., 1981.[16] Christof Koch, Tomaso Poggio, and Vincent Torre. Nonlinear interactions in a dendritic tree:localization, timing, and role in information processing.

Proceedings of the National Academyof Sciences , 80(9):2799–2802, 1983.[17] Alexei A Koulakov and Dmitry Rinberg. Sparse incomplete representations: a potential role ofolfactory granule cells.

Neuron , 72(1):124–136, 2011.[18] Holger G Krapp and Roland Hengstenberg. Estimation of self-motion by optic ﬂow processingin single visual interneurons.

Nature , 384(6608):463, 1996.[19] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrixfactorization.

Nature , 401(6755):788–791, 1999.[20] KV Mardia, JT Kent, and JM Bibby.

Multivariate analysis . Academic press, 1980.[21] Roland Memisevic. Learning to relate images: Mapping units, complex cells and simultaneouseigenspaces. arXiv preprint arXiv:1110.0107 , 2011.[22] Roland Memisevic. Learning to relate images.

IEEE Transactions on pattern analysis andmachine intelligence , 35(8):1829–1846, 2013.[23] Xu Miao and Rajesh PN Rao. Learning the lie groups of visual invariance.

Neural computation ,19(10):2665–2693, 2007.[24] E Oja. Principal components, minor components, and linear neural networks.

Neural Networks ,5(6):927–935, 1992.[25] Erkki Oja. Simpliﬁed neuron model as a principal component analyzer.

J. Math. Biol. , 15(3):267–273, 1982.[26] Bruno A Olshausen, Charles Cadieu, Jack Culpepper, and David K Warland. Bilinear modelsof natural images. In

Human Vision and Electronic Imaging XII , volume 6492, page 649206.International Society for Optics and Photonics, 2007.1027] Bruno A Olshausen and David J Field. Emergence of simple-cell receptive ﬁeld properties bylearning a sparse code for natural images.

Nature , 381:607–609, 1996.[28] Cengiz Pehlevan and Dmitri Chklovskii. A normative theory of adaptive dimensionalityreduction in neural networks. In

Advances in neural information processing systems , pages2269–2277, 2015.[29] Cengiz Pehlevan and Dmitri B Chklovskii. A Hebbian/anti-Hebbian network derived fromonline non-negative matrix factorization can cluster and discover sparse features. In , pages 769–775. IEEE, 2014.[30] Cengiz Pehlevan, Tao Hu, and Dmitri B Chklovskii. A Hebbian/anti-Hebbian neural networkfor linear subspace learning: A derivation from multidimensional scaling of streaming data.

Neural computation , 27(7):1461–1495, 2015.[31] Cengiz Pehlevan, Anirvan M Sengupta, and Dmitri B Chklovskii. Why do similarity matchingobjectives lead to Hebbian/anti-Hebbian networks?

Neural Computation , 30(1):84–124, 2018.[32] Rajesh PN Rao and Daniel L Ruderman. Learning lie groups for invariant visual perception. In

Advances in neural information processing systems , pages 810–816, 1999.[33] Werner Reichardt. Autocorrelation, a principle for the evaluation of sensory information by thecentral nervous system.

Sensory communication , pages 303–317, 1961.[34] Emilio Salazar-Gatzimas, Margarida Agrochao, James E Fitzgerald, and Damon A Clark. Theneuronal basis of an illusory motion percept is explained by decorrelation of parallel motionpathways.

Current Biology , 28(23):3748–3762, 2018.[35] Anirvan Sengupta, Cengiz Pehlevan, Mariano Tepper, Alexander Genkin, and Dmitri Chklovskii.Manifold-tiling localized receptive ﬁelds are optimal in similarity-preserving neural networks.In

Advances in Neural Information Processing Systems , pages 7080–7090, 2018.[36] Eero Peter Simoncelli.

Distributed representation and analysis of visual motion . PhD thesis,Massachusetts Institute of Technology, 1993.[37] Sandra Single and Alexander Borst. Dendritic integration and its role in computing imagevelocity.

Science , 281(5384):1848–1850, 1998.[38] Shiva R Sinha, William Bialek, and Rob R van Steveninck. Optimal local estimates of visualmotion in a natural environment. arXiv preprint arXiv:1812.11878 , 2018.[39] Nelson Spruston. Pyramidal neurons: dendritic structure and synaptic integration.

NatureReviews Neuroscience , 9(3):206, 2008.[40] Greg Stuart, Nelson Spruston, Bert Sakmann, and Michael Häusser. Action potential initiationand backpropagation in neurons of the mammalian cns.

Trends in neurosciences , 20(3):125–131,1997.[41] Shin-ya Takemura, Aljoscha Nern, Dmitri B Chklovskii, Louis K Scheffer, Gerald M Rubin,and Ian A Meinertzhagen. The comprehensive connectome of a neural substrate for ‘on’motiondetection in drosophila.

Elife , 6, 2017.[42] V Torre and T Poggio. A synaptic mechanism possibly underlying directional selectivity to mo-tion.

Proceedings of the Royal Society of London. Series B. Biological Sciences , 202(1148):409–416, 1978.[43] Christoph Von Der Malsburg. The correlation theory of brain function. In

Models of neuralnetworks , pages 95–119. Springer, 1994.[44] Jimmy Wang, Jascha Sohl-Dickstein, and Bruno Olshausen. Unsupervised learning of lie groupoperators from image sequences. In

Frontiers in Systems Neuroscience. Conference Abstract:Computational and systems neuroscience , volume 1130, 2009.[45] Carl FR Wienecke, Jonathan CS Leong, and Thomas R Clandinin. Linear summation underliesdirection selectivity in drosophila.

Neuron , 99(4):680–688, 2018.[46] Christopher KI Williams. On a connection between kernel pca and metric multidimensionalscaling. In

Advances in neural information processing systems , pages 675–681, 2001.11 upplementary Materials

This is the supplementary material for NeurIPS 2019 paper: "A Similarity-preserving Neural NetworkTrained on Transformed Images Recapitulates Salient Features of the Fly Motion Detection Circuit"by Y. Bahroun, A. M. Sengupta, and D. B. Chklovksii.

A Alternative features for learning transformations

As described in Section 2.2 (main text), our algorithm for motion detection adopts multiplication todetect correspondences, it computes an outer product between the vectors of pixel intensity, x t , andpixel intensity variation, ∆ x t . Although an outer product feature was proposed in [21], our modelmost resembles that of [3]. We show in the following that in the ﬁrst order our model and that of [3]are related.Let us consider a mid-point in time between t and t + 1 , denoted by t + 1 / . Using the sameapproximation as before, x t +1 / can be expressed as function of either x t or x t +1 as x t +1 / = x t + 12 θ t A a x t (20) = x t +1 − θ t A a x t +1 (21)By subtracting the two equations above we obtain x t +1 − x t = 12 θ t A a ( x t +1 + x t ) . (22)We can thus rephrase the reconstruction error as: min θ t , A a (cid:88) t (cid:107) ∆ x t − K (cid:88) a =1 θ at A a ( x t +1 + x t ) (cid:107) (23)leading to the cross-term θ t A a ( x t +1 + x t )∆ x (cid:62) t , which corresponds to one proposed by [3]. Aninteresting observation regarding the outer product ( x t +1 + x t )∆ x (cid:62) t , is that, unlike the one used inthe main text, this feature has a time-reversal anti-symmetry. This helps detecting the direction ofchange.Another feature considered in [3] is x t +1 x (cid:62) t − x t x (cid:62) t +1 (24)which exhibits an anti-symmetry both in time and in indices.The anti-symmetric property of the learnable transformation operators is expected to arise from thedata rather than by construction. In fact, only elements of the algebra of SO( n ) are anti-symmetric,which, in all generality, constitute only a subset of all possible transformations in GL( n ). Nevertheless,we obtain again in the ﬁrst order: x t +1 x (cid:62) t − x t x (cid:62) t +1 = ( x t + θ Ax t ) x (cid:62) t − x t ( x t + θ Ax t ) (cid:62) = θ Ax t x (cid:62) t − ( θ Ax t x (cid:62) t ) (cid:62) . (25) B K-means and PCA for transformation learning

We proposed to evaluate the SM and NSM model for transformation learning in relation withassociated K-means and PCA models for transformation learning. The models are respectivelydeﬁned as the solution of the following optimization problems: min A ∈ R K × n (cid:88) t (cid:107) χ t − A (cid:62) A χ t (cid:107) s . t . AA (cid:62) = I , (26) min Θ ∈ R K × T + , A (cid:62) a, : ∈ R n (cid:88) t,a Θ t,a (cid:107) χ t − A (cid:62) a, : (cid:107) s . t . K (cid:88) a =1 Θ t,a = 1 ∀ t ∈ { , . . . , T } . (27)12 Detecting motion by correlating spatial and temporal derivatives

In the following we consider the approximation of the learned ﬁlters Fig.2a by the cartoon versionFig. 2b (main text). The magnitude of translation can be evaluated by solving a linear regressionbetween the temporal, ∆ t x t , and spatial, ∆ i x t , derivatives of pixel intensities. Conventionally, thisis done by minimizing the mismatch squared, min θ t (cid:107) ∆ t x t + θ t ∆ i x t (cid:107) + λθ t , (28)where the ﬁrst term enforces object constancy [36] and the second second term is a regularizer whichmay be thought to arise from a Gaussian prior on the velocity estimator. By differentiating (28) withrespect to θ t and setting the derivative to zero, we ﬁnd: θ t = ∆ t x (cid:62) t ∆ i x t / ( (cid:107) ∆ i x t (cid:107) + λ ) ≈ ∆ t x (cid:62) t ∆ i x t /λ . (29)The latter approximation is justiﬁed by the fact that the regularizer λ dominates the denominator inthe realistic setting [38]. This demonstrates that the magnitude of translation per frame is given bythe normalized correlation of the spatial and temporal derivatives across the visual ﬁeld.The spatial gradient of pixel intensity, ∆ i x t,i at pixel i can be computed as the mean of the differenceswith the right and the left nearest pixels [36] : ∆ i x t,i = 12 (( x t,i +1 − x t,i ) + ( x t,i − x t,i − )) = 12 ( x t,i +1 − x t,i − ) = [ Ax t ] i , (30)where matrix, A , matches the generator of translation learned by the various models above. A = 12  − − − −  . (31)Then, the magnitude of translation, θ t , is proportional to: ∆ t x (cid:62) t ∆ i x t = ∆ t x (cid:62) t Ax t =  ... ∆ t x i − ,t ∆ t x i,t ∆ t x i +1 ,t ...  (cid:62)  ... x t,i − x t,i − x t,i +1 − x t,i − x t,i +2 − x t,i ...  . (32) D Learning rotations and translations of 2D images

In addition to learning 1D translations of 1D images, SM and NSM can learn other types of transfor-mations.

D.1 Planar rotations of 2D images

We applied our model to pairs of randomly generated two-dimensional images rotated by a smallangle relative to each other. To this end, we ﬁrst generated seed images with random pixel intensities.From each seed image, we generated a transformed image by applying small clockwise and counter-clockwise rotations with different angles. We then presented these pairs to the algorithms. Again, wechose K = 2 , and the models were evaluated against standard PCA and K-means as described in themain text Section 3.NSM applied to rotated × pixel images learns transformation matrix of clockwise and counter-clockwise rotations, Fig.5a and 5b. Similarly, K-means recovers both generators of rotations (resultsnot shown). As before, SM and PCA recover only one rotation generator. In their work Rao et al.[32] could also also only account for one direction of rotation, either clockwise or counter-clockwiseas it is the case for SM and PCA. 13 :,$ ∆! :,$ ∆! &,: ! &,: (a) (b) (c) Figure 5: Learning and evaluation of rotations of 2D images. The ﬁlters learned by NSM (a)accounting for clockwise rotation are displayed as an array of weights that, for each pixel of ∆ x t ∈ R × , shows the strength of its connection to each of the x t ’s matrix pixels. Similarly for (b)the ﬁlters accounting for counterclockwise rotations. In (c) the learned ﬁlter accounting for clockwiserotation is applied multiple times to diagonal bar (read left to right, top to bottom).A naive evaluation of the accuracy of the learned ﬁlters was performed by applying a learned ﬁlter toa diagonal bar on a × pixel image as shown in Fig. 5c. After multiple application of the rotationoperator, artifacts start to appear. We can here observe the limitations of the Lie algebra generatorsinstead of the Lie groups and exponential maps, which would account for large transformations.Reshaping the ﬁlters shows that each component of the ﬁlter connects ∆ x t,i,j only to nearby x t,i ± ,j ± and the connection between ∆ x t,i,j and x t,i,j is absent as before. This generalizes thethree-pixel model to another type of transformation. It plausibly explain how the Drosophila circuitresponsible for detecting roll, pitch and yaw [18] is learned. D.2 Translations of 2D images

In the case of 1D translations and 2D rotations there is only one generator of transformation (forsign-unconstrained output), which explains the choice of K = 2 for signed-constrained output. In thecase of 2D images undergoing both horizontal and vertical motions our model learns two differentgenerators, left-right and up-down motion ( K = 4 for sign-constrained output). Fig.6a-b-c-d showthe ﬁlters learned by our model, each accounting for a motion in a cardinal direction. These generatorswere also reported in [23]. ! :,$ ∆! :,$ ∆! &,: ! &,: (b)(a) (c) (d) Figure 6: Learning of translations of 2D images. The ﬁlters learned by NSM (a) accounting forright-to-left horizontal motion are displayed as an array of weights such that, for each pixel of ∆ x t ∈ R × , shows the strength of its connection to each of the x t ’s matrix pixels. Similarly for (b)the ﬁlters accounting for left-to-right horizontal, (c) downward and (d) upward motions. E Equivalence of Global Motion Estimators

Using the expression for translation magnitude derived in the previous section we can show thatthe integration of the three-pixel model output over the visual ﬁeld produces the same result as thatof the integration of the output of the popular Hassenstein-Reichardt detector (HRD) [33]. This isdone by considering the discretized time derivative ∆ t x t = x t − x t − τ , where τ would identify as a14ime-delay in HRD. The following holds true when the learned ﬁlters Fig.2a are approximated by thecartoon version Fig. 2b (main text).Consider ﬁrst the central pixel i for which the temporal derivative is taken. The output activity of ourdetector, y i ( t ) , can be obtained as follows: y i ( t ) = (cid:0) x i ( t ) − x i ( t − τ ) (cid:1) × (cid:0) x i +1 ( t ) − x i − ( t ) (cid:1) (33) = x i ( t ) x i +1 ( t ) − x i ( t ) x i − ( t ) − x i ( t − τ ) x i +1 ( t ) + x i ( t − τ ) x i − ( t ) (34)Consider now pixel i + 1 as central. Then y i +1 ( t ) = (cid:0) x i +1 ( t ) − x i +1 ( t − τ ) (cid:1) × (cid:0) x i +2 ( t ) − x i ( t ) (cid:1) = x i +1 ( t ) x i +2 ( t ) − x i +1 ( t ) x i ( t ) − x i +1 ( t − τ ) x i +2 ( t ) + x i +1 ( t − τ ) x i ( t ) Then after adding y i ( t ) and y i +1 ( t ) given above, the terms depending on ( t ) but not on ( t − τ ) canceleach other. The other terms, with a dependence in ( t − τ ) can be combined to produce HR detectorsleading to the following y i ( t ) + y i +1 ( t ) = − x i ( t ) x i − ( t ) + x i +1 ( t ) x i +2 ( t )+ x i ( t − τ ) x i − ( t )+ x i +1 ( t − τ ) x i ( t ) − x i ( t − τ ) x i +1 ( t ) − x i +1 ( t − τ ) x i +2 ( t ) . By denoting HR ( i, t ) = x i +1 ( t − τ ) x i ( t ) − x i ( t − τ ) x i +1 ( t ) and by adding successive elementsone can obtain the following: n (cid:88) i = − n y i ( t ) = n − (cid:88) i = − n HR ( i, t )+ x − n − ( t − τ ) x − n − ( t ) − x n ( t − τ ) x n +1 ( t ) − x − n ( t ) x − n − ( t ) + x n ( t ) x n +1 ( t ) (35)This proves a formal equivalence between the proposed model and the HRD when averaged over thepixels of the visual ﬁeld. See illustration in Fig.7, since for large ﬁelds the boundary contributionvanishes. ! i t ~ j = i − n + i + n x j t − x j t − ' × x j ) t − x j *+ t ! i t ~ j = i − n + i + n x j t − ' × x j )+ t − x j t − ' × x j *+ t − ' x i ) t x i t x i * t ,,- + - ,,- + - ,,- + - Σ ! i t ~ j = i − n + i + n x j t × x j () t − * − x j t − * × x j () tx i + t * x i t * x i ( t * Σ + -+ - (a) (b) Figure 7: Equivalence between (a) HRD and (b) the cartoon version of the learned model and afterintegration over the visual ﬁeldInterestingly, the expression (35) can be evaluated by the neural circuit suggested by ﬂy connectomics.The visual ﬁeld is tiled by EMD neurons each EMD neuron, i , computing a product between ∆ t x i,t and A i, : x tt