[PDF] Enacted Visual Perception: A Computational Model based on Piaget Equilibrium

Abstract

In Maurice Merleau-Ponty's phenomenology of perception, analysis of perception accounts for an element of intentionality, and in effect therefore, perception and action cannot be viewed as distinct procedures. In the same line of thinking, Alva No\"{e} considers perception as a thoughtful activity that relies on capacities for action and thought. Here, by looking into psychology as a source of inspiration, we propose a computational model for the action involved in visual perception based on the notion of equilibrium as defined by Jean Piaget. In such a model, Piaget's equilibrium reflects the mind's status, which is used to control the observation process. The proposed model is built around a modified version of convolutional neural networks (CNNs) with enhanced filter performance, where characteristics of filters are adaptively adjusted via a high-level control signal that accounts for the thoughtful activity in perception. While the CNN plays the role of the visual system, the control signal is assumed to be a product of mind.

Full PDF

11 Enacted Visual Perception: A Computational Modelbased on Piaget Equilibrium

Aref Hakimzadeh, Yanbo Xue, and Peyman Setoodeh

Abstract —In Maurice Merleau-Ponty’s phenomenology of per-ception, analysis of perception accounts for an element of inten-tionality, and in effect therefore, perception and action cannotbe viewed as distinct procedures. In the same line of thinking,Alva No¨e considers perception as a thoughtful activity that relieson capacities for action and thought. Here, by looking intopsychology as a source of inspiration, we propose a computationalmodel for the action involved in visual perception based onthe notion of equilibrium as deﬁned by Jean Piaget. In sucha model, Piaget’s equilibrium reﬂects the mind’s status, which isused to control the observation process. The proposed model isbuilt around a modiﬁed version of convolutional neural networks(CNNs) with enhanced ﬁlter performance, where characteristicsof ﬁlters are adaptively adjusted via a high-level control signalthat accounts for the thoughtful activity in perception. While theCNN plays the role of the visual system, the control signal isassumed to be a product of mind.

Index Terms —Piaget equilibrium, schema theory, visual per-ception, convolutional neural network.

I. I

NTRODUCTION A RTIFICIAL intelligence (AI) can signiﬁcantly beneﬁtfrom the ongoing research in neuroscience and psy-chology. Looking into these ﬁelds as a source of inspirationwill pave the way for building learning machines that mimicbiological brains. CNN inspired by studying cat’s visual cortexcan be viewed as a success story in following this line ofthinking. In the past few years, CNN and its variants haveplayed key roles in building machine-vision systems. CNN-based architectures are built from a combination of layersthat implement convolution, pooling, and nonlinear functions[1]. In convolutional layers of a CNN, usually, a set of ﬁxedﬁlters are used to scan the input data, hence, there is a lackof controllability over the ﬁlters [2]. Adaptively changing theﬁlters at each layer (possibly independent from other layers)will enhance the network’s degree of plasticity even withoutchanging the layer structure of the network. Deploying such amechanism will pave the way for building effective attentivesensors. Let us assume that we are watching a landscape (i.e.,a soccer game), sometimes we prefer to focus on a particularregion or object (i.e., ball in a soccer game), and sometimeswe are interested in a wider view. The basic CNN variants donot provide such a degree of freedom unless they are equippedwith an attention mechanism [3], [4] or a high-level controller

A. Hakimzadeh and P. Setoodeh are with the School of Electri-cal and Computer Engineering, Shiraz University, Shiraz, Iran (e-mail:[email protected]; [email protected]).Y. Xue is with the Career Science Lab, Beijing, China, and also with theDepartment of Control Engineering, Northeastern University Qinhuangdao,China (e-mail: [email protected]). to adjust the shape of ﬁlters. If a CNN-based machine-visionalgorithm is used in a more sophisticated system such as arobot or an autonomous-driving car, the high-level controlsignal for adjusting the CNN’s ﬁlters can be provided by theentity that plays the brain’s role and is superior to the visualsystem.Human’s ability to learn from visual data has unique char-acteristics that distinguish it from the state-of-the-art machinevision systems. Zoom-in and zoom-out are common proce-dures that we do every day. Most of the time we do not scanour environment precisely until something raises our curiosity.Biological visual systems use a ﬁxed number of photoreceptorsin the retina to convert light into nerve impulses that aretransmitted to the higher layers of the visual cortex. Whilethe main neural structure involved in vision (in the sense ofthe number of photoreceptors) remains unchanged for differentmanners of looking, visual perception occurs through movingeyes around (saccade) and zooming. Hence, visual perceptioncan be better understood as an enacted perception [5], [6].Here, we aim at designing a system that is able to mimichuman adaptive visual behavior despite structural constraints.A control signal is used to control how a scene should bescreened or what part of the scene should be focused on.For instance, a control variable is needed to determine themanner of focusing on a certain object or a speciﬁc area todetect movements or anomalies. To address this issue, we lookinto psychology as a source of inspiration to deﬁne a propermind status that can be interpreted as the required controlsignal [7]. In what follows, more details will be providedon implementing the control mechanism for adjusting ﬁltersfollowed by proposing a criterion for equilibrium. Then, theproposed model will be validated on a real dataset.The rest of the paper is organized as follows. Section IIreviews the Piaget’s schema theory and the equilibrium con-cept. Section III covers the ﬁlter design for CNN. Section IVprovides simulation results. Section V presents the concludingremarks. II. P

IAGET ’ S E QUILIBRIUM

As mentioned previously, in order to have a precise deﬁ-nition of the control signal for guiding the visual system, itwould be helpful to use Piaget’s equilibrium. First, we recallthe deﬁnitions of schema, assimilation, accommodation, andequilibration from psychology [8]: • Schemas refer to the basic building blocks of cognitionthat make it possible to form a mental representation ofthe environment. a r X i v : . [ c s . A I] J a n • Assimilation refers to the similarity between the existingschemas and a new situation encountered. • Accommodation refers to the elements of a new situationthat are either not contained in the existing schema orcontradict them. • Equilibration refers to the balance between assimilationand accommodation. Having the intention to perform atask or ﬁnd a solution to a problem calls for assimilationof information to partially match the individual’s mentalschemas as well as accommodation of information bymodifying the individual’s way of thinking to adoptit. Therefore, problem solving can be studied under anequilibration criterion [9].The notion of equilibration can be adopted in CNN-basedmachine vision systems, where a control signal is needed toreﬂect mind’s status. In this framework, for ﬁlters with ﬁxednumbers of cells, the control signal will be responsible forcells’ topology arrangement. There is no need for such ﬁltersto be fully connected, which means that there could be a gapbetween ﬁlter’s cells. To be more precise, some of the ﬁlter’scells could be null cells. According to Piaget’s deﬁnition ofequilibrium, as the distance from the equilibrium increasesin a non-equilibrium condition, ﬁlter cells should be morecondensed and ﬁlter should sweep the image faster till thedistance decreases. Then, in a close-to-equilibrium condition,a sparser ﬁlter that sweeps the image slower would work. Thesweeping speed of the ﬁlter can be adjusted by its stride. Themeasure of distance from equilibrium can be considered as avalue between zero and one, where zero reﬂects a completelystable mind and one means a totally unstable mind. Suchmechanisms can be embedded into the CNN architectureusing ﬂoating discrete controllable ﬁlters as described in thefollowing section.III. F

ILTER R EALIZATION

Flowchart of the training procedure for a CNN with ﬂoatingdiscrete ﬁlters (FDF) is depicted in Fig. 1. Two approaches areproposed for ﬁlter implementation using neural networks andmathematical functions: • Mathematical function: A family of spiral-shaped func-tions with adjustable interim and boundaries can be usedfor ﬁlter realization. In polar coordinates, such ﬁltersare mathematically described as f ( θ ) = R ( θ ) e jθ , whereradius R is a function of phase θ ∈ [0 , kπ ] , and k ischosen according to the step size used for discretizationas well as the total number of ﬁlter’s cells. Choosing theform of the function R ( θ ) provides a degree of freedomin the design process. One option would be R ( θ ) = θ αn β ,where n ∈ ( − , denotes the scaled equilibrium value, α ∈ (0 , is a hyper-parameter, and β is a normalizingcoefﬁcient. While for negative values of n , ﬁlter cellsare more concentrated around the center, for positivevalues of n , they are more scattered. Hence, adjustingthe value of n may be viewed as taking a countermeasureto compensate for the distance from equilibrium. Fig. 2shows a number of ﬁlters plotted for different values of n with α = β = 1 . It should be noted that in general, α and β will have different values. Figure 1. Training procedure for a CNN with floating discrete filters.

Fig. 1: Training procedure for a CNN with ﬂoating discretecontrollable ﬁlters. • Neural network: A multilayer perceptron (MLP) [10],[11] can be trained for ﬁlter realization. In each com-putation step, the MLP receives the equilibrium value asits input and returns the location of 25 percent of the cellsin the ﬁlter with their corresponding values as its output.The other 75 percent of the cells are considered as nullcells.One common technique in machine learning is featureencoding. Therefore, it would be informative to investigate theproposed algorithm from the perspective of hierarchical dataabstraction. Although the number of cells in a ﬁlter may beconstant during visual data processing with ﬂoating discreteﬁlters, at each step, the information stored in weights havedifferent interpretations. Every equilibrium value leads to adifferent ﬁlter shape, and therefore, a different representationof the image in the form of ﬁlter weights. As the distancefrom equilibrium decreases and ﬁlters with more scatteredshapes can be used, weights of the network will provide a moreabstract representation of the image. In different applications,every set of weights, which corresponds to an equilibriumvalue, can be separately stored in memory and used fordecoding data at various levels of abstraction. A relationallearning scheme can be deployed for learning a bidirectionalassociation between different equilibria and ﬁlter shapes. Us-ing such a memory-augmented version of the algorithm [12], n = 0.5 n = 0.4 n = 0.3 n = 0.2 n = 0.1 n = -0.1 n = -0.2 n = -0.3 n = -0.4 Figure 2. Filter shapes 𝑓(𝜃) = 𝜃 𝛼𝑛 𝛽 𝑒 𝑗𝜃 for different values of n with 𝛼 = 𝛽 = 1 . As the distance from equilibrium increases, we will have a more concentrated filter around center. IV.

EXPERIMENTAL RESULTS

Concluding previous sections, solving mentioned problem is equivalent to solving a bi-level optimization problem defined as follows: Max O Subject to: Min ∑ (𝑦 𝑜 − 𝑦͂) Subject to: equilibrium > c Where O is the order of current filter, 𝑦 𝑜 is the output of network scanned with filter of order O, and c is a constant that determines a threshold for equilibrium. One way of solving such problems is to find an optimum value for leader problem and optimize the follower problem with respect to that value; then the output of follower will affect leader problem to find a better value. Here, filter will be designed with most scattered manner, which corresponds to the highest possible order of filter; if the proposed order satisfies Fig. 2: Filter shapes f ( θ ) = R ( θ ) e jθ for different values of nwith α = β = 1 . As the distance from equilibrium increases,we will have a more concentrated ﬁlter around center.discrete ﬂoating ﬁlters can be used for encoding visual data,and equilibrium values provide the key for decoding data.To expedite the learning process, we can start with a pre-trained network using the most scattered ﬁlter possible. Ifresults are not satisfying, distance from equilibrium increasesand ﬁlter’s shape is changed accordingly. This characteristic al-lows for using the proposed architecture for anomaly detection.Moreover, it paves the way for deriving attentive algorithms.In a recognition system, in the last layer before the softmaxlayer in the CNN, we can determine the best part of the datato pay attention to. The problem of augmenting the algorithmwith an attention mechanism can be divided into two majorsub-problems: • Class-based attention: According to [13], attention canbe implemented using the pooling operation in the lastlayer of the CNN. The class activation mapping usesthe weights connected to the nodes associated to thedetermined classes in the output layer. Hence, in thelast layer of the CNN, dimension can be matched to theinput dimension using an up-sampling mechanism. Thesame approach can be adopted by the proposed algorithmfollowed by re-scanning the highlighted part with a morecondensed ﬁlter. This process resembles how humans ﬁnda region of interest in a scene and then focus on it. • Attention-based anomaly detection: The most relevantalgorithm to this idea is the grad-CAM [14]. In a pre-trained network, a change that occurs in the input in-creases the gradient in the last layer with respect tothe previous knowledge stored in the memory. Back- propagating these gradients would cause some nodes ofthe last layer of the CNN to become more inﬂuential thanothers. By up-sampling the output to match the size ofthe input, the area that caused an increase in the distancefrom equilibrium or the part containing abnormality canbe determined. Then, a decision can be reached usingmore condensed ﬁlters on the speciﬁed area. It is similarto the class-based attention with the difference that here,the gradient of the softmax layer is tracked by the systeminstead of the weights connected to the speciﬁc outputnode. IV. E

XPERIMENTS

Floating discrete ﬁlters can be designed by solving thefollowing bi-level optimization problem:maximize O (1)subject to: minimize W (cid:88) i ( y di − y i ) subject to: equilibrium > c where O denotes the order of the current ﬁlter, y d and y are the desired output and the network output scannedby a ﬁlter of order O , respectively. c is a constant thatdetermines the threshold for equilibrium value (i.e., a measureof distance from equilibrium condition), and W refers tonetwork parameters. This problem can be solved in an iterativemanner; starting with a value for the manipulated variable ofthe leader problem, the follower problem is solved, then, ﬁxingthe manipulated variable of the follower problem, the leaderproblem is solved to ﬁnd a better value for the manipulatedvariable of the leader problem. In this way, both of the ma-nipulated variables converge to their optimal values. To designan optimal ﬁlter, we start with the most scattered structure,which corresponds to the highest possible order of the ﬁlter.If the proposed order satisﬁes the follower problem, processends; otherwise, order is decreased and the follower problemis checked until both levels of the problem are satisﬁed.The CIFAR-10 benchmark dataset was chosen to evaluatethe proposed architecture, where 60%, 20%, and 20% of datapoints were selected as the training, validation, and test sets,respectively. In order to implement the idea of ﬂoating discreteﬁlters, the notion of Piaget’s equilibrium must be deﬁnedmathematically. For instance, equilibrium can be deﬁned as afunction of variance of the network output, such that a highervariance represents a larger distance from equilibrium. Threesets of experiments were performed. In the ﬁrst experiment,fulﬁlling the required conditions, the deployed ﬂoating discreteﬁlters were handcraft designed for each discrete equilibriumvalue. In this scenario, the kernel design step in the ﬂowchartof Fig. 1 can be bypassed. In the second experiment, ﬁlterswere designed using the mathematical functions described inthe previous section for given equilibrium values.These experiments were performed as follows. In the ﬁrstexperiment, a network was trained with normal ﬁlters, andthe trained network’s performance was evaluated based on thetest set. In the second experiment, equilibrium was initiallyset to the highest possible value (one). When network makes a prediction using designed ﬁlters, if variance of the outputlayer is acceptable, process terminates, otherwise, equilibriumvalue is decreased. Decreasing the equilibrium value results inmore concentrated ﬁlters. Then, network output is calculatedbased on the new ﬁlters, and this procedure continues untilthe network can provide a trustworthy answer, which can beinterpreted as reaching the desired threshold for variance. Dueto discretization, there exist only nine possible equilibriumconﬁgurations and ﬁlter shapes, hence, there will be ninepossible steps in the FDF design process.The following four criteria are considered to compare theperformance of different architectures: • Criterion 1: percentage of the correct answers predictedby FDFs (algorithm precision) • Criterion 2: normal ﬁlter prediction, which is availabkeas a candidate during the FDF design process • Criterion 3: ﬁrst and last predicted answers of a singleFDF process are the same • Criterion 4: mean value of the number of times that ﬁltershape was changed during the process (mean numbersteps per instance)In the ﬁrst experiment, three different networks with differ-ent architectures were trained. For each network, six differentthresholds were considered for variance, which determines thestopping criterion, and for each threshold, the mentioned fourcriteria were calculated as percentage. The ﬁrst architecturehas the following characteristics: • First layer: a CNN with 3x3 ﬁlter size and one outputﬁlter • Second layer: a fully connected layer with 1024 neurons • Output layer: a layer with 10 neuronsResults are summarized in Table I. Precision of this networkwith normal ﬁlter shape was 43.99%. Number 9 appearing inthe last row for criterion 4 shows that uncertainty was presenttill the designed ﬁlter took a shape similar to the normal ﬁlter.This condition can be interpreted as reaching zero value forequilibrium. First column of the table shows a precision closeto that of normal ﬁlter while the mean number of steps foreach instance is near to one. This means that using a ﬁlter,which is twice larger than the normal one and reduces thecomputational burden in each layer up to 50%, could achievea precision close to the typical process. Considering only theﬁrst output of the FDF process and ignoring the rest, which isequivalent to designing the ﬁlter with an equilibrium value ofone, leads to similar predictions for both FDF and normal ﬁlter83.81% of the time. This conﬁrms the generality of designedﬁlters regadless of the process functionality, which is reﬂectedin criterion 3.The second architecture used in the ﬁrst experiment has thefollowing characteristics: • First layer: a CNN with 3x3 ﬁlter size with three outputﬁlter • Second layer: a fully connected layer with 1024 neurons • Output layer: a layer with 10 neuronsResults are summarized in Table II. Precision of this networkwith normal ﬁlter shape was 52.47%. In this table, the secondcolumn has beneﬁts of having near normal ﬁlter precision and small mean number of steps, which conﬁrms reducedcomputational effort while achieving nearly the same levelof performance. Regarding generalization, by just consideringthe ﬁrst-stage prediction of FDF, in 72.47% of the time,predictions would be the same as those of the normal ﬁlter.The third architecture used for the ﬁrst experiment has thefollowing characteristics: • First layer: a CNN with 3x3 ﬁlter size with eight outputﬁlter • Second layer: a fully connected layer with 1024 neurons • Output layer: a layer with 10 neuronsResults are summarized in Table III. Precision of this networkwith normal ﬁlter shape was 57.9%. Similar to the previoustwo cases, for this network, reduced computational burdenleads to faster performance.For the second experiment, an architecture similar to the ﬁrstarchitecture was used and a kernel was designed based on themathematical functions described in the previous section. Thefollowing architecture was used for this experiment: • First layer: a CNN with 3x3 ﬁlter size with one outputﬁlter • Second layer: a fully connected layer with 1024 neurons • Output layer: a layer with 10 neuronsResults are summarized in Table IV. By comparing tables Iand IV, which correspond to the same architecture, we canconclude that the mathematical ﬁlter design approach showspromising results and achieves normal ﬁlter precision withless computational effort. The third row of Table IV showsthe generality of this approach, which can be used for datacompression or even abstraction without loss of information.Results show that increasing the complexity of the networkwill increase the gap in performance between architecturesthat use FDFs and normal ﬁlters. It was expected becausethe FDF process was designed to improve the generalizationability of CNNs via keeping a simple structure and providingﬂexibility to ﬁlters, while maintaining an acceptable levelof performance. However, increasing complexity is againstthe main idea behind the FDF design process. Improvingperformance of FDF must be achieved through better deﬁnitionof the equilibrium value. In this section, a simple interpretationof equilibrium in terms of variance was implemented to showthe effectiveness of the proposed method.In the third experiment, entropy of the output layer was usedinstead of variance to represent equilibrium, such that a higherentropy demonstrates a less conﬁdent prediction and hencea lower equilibrium value [15], [16]. Two case studies wereinvestigated with network architectures and conﬁgurationssimilar to the ﬁrst case in the ﬁrst experiment and the secondexperiment except for using entropy instead of variance.Results are summarized in tables V and VI, respectively.Networks can achieve better precision and faster processingin this case. In addition to improvement in the processingtime, using entropy to represnt equilibrium improved precisioneven beyond what was achievable by the CNN with normalﬁlters using variance (Tables V and VI cover a wider range ofequilibrium thresholds compared to tables I to IV). Therefore,entropy can provide a valid stopping criterion during thr

TABLE I: Results of experiment 1, case 1.

Variance Threshold 0.05 0.07 0.08 0.088 0.09 0.1Criterion 1 42.06 41.53 41.23 40.69 43.99 43.99Criterion 2 84.60 85.31 85.86 87.06 99.56 99.56Criterion 3 95.54 91.43 88.74 83.68 83.93 83.93Criterion 4 1.089 1.227 1.356 1.703 9 9

TABLE II: Results of experiment 1, case 2.

Variance Threshold 0.05 0.07 0.08 0.088 0.09 0.1Criterion 1 34.38 48.95 48.57 48.98 49.42 52.74Criterion 2 41.12 76.46 78.02 81.96 86.86 99.49Criterion 3 90.03 86.22 83.1 79.06 76.09 72.37Criterion 4 1.36 1.756 2.42 3.412 4.714 9

TABLE III: Results of experiment 1, case 3.

Variance Threshold 0.05 0.07 0.08 0.088 0.09 0.1Criterion 1 51.23 50.34 49.7 49.02 57.9 57.9Criterion 2 75.0 76.19 76.99 79.19 99.44 99.44Criterion 3 94.13 89.56 86.54 81.28 73.51 73.51Criterion 4 1.185 1.448 1.719 2.465 9 9

TABLE IV: Results of experiment 2.

Variance Threshold 0.05 0.07 0.08 0.088 0.09 0.1Criterion 1 43.71 43.61 43.62 43.45 43.37 43.99Criterion 2 82.25 83.26 84.03 85.88 89.39 99.32Criterion 3 95.7 92.42 90.59 87.02 83.43 81.16Criterion 4 1.126 1.333 1.543 2.12 3.32 9

TABLE V: Results of experiment 3, case 1.

Entropy Threshold 10e-11 0.001 0.005 0.02 0.2 0.3Criterion 1 45.0 44.3 43.7 42.9 40.9 41.8Criterion 2 80.8 81.1 81.4 82.7 84.7 89.5Criterion 3 100 97.9 96.1 90.2 82.0 80.4Criterion 4 1.03 2.02 2.41 3.04 3.88 5.021

TABLE VI: Results of experiment 3, case 2.

Entropy Threshold 10e-11 0.001 0.005 0.02 0.2 0.3Criterion 1 42.6 41.12 39.3 38.3 38.5 40.0Criterion 2 84.45 84.45 83.5 83.9 85.6 90.0Criterion 3 99.75 95.87 90.5 83.1 77.7 77.8Criterion 4 1.053 1.47 2.04 2.59 3.45 4.38 training phase.Although the two proposed mathematical representationsof equilibrium were fairly simple, experiments showed theeffectiveness of the proposed approach for improving theﬂexibility of CNNs. Hence, it can be viewed as a step towardsdesigning a proper high-level controller for mimicking humanvisual system. V. C

ONCLUDING R EMARKS

Performance of a machine vision system can be enhancedby rescanning data using a subset of ﬁlters after the answeris known in order to ﬁne-tune some of the ﬁlter weights. Inthis way, the system can recognize the picture with feweriterations and a wider ﬁlter. Moreover, using the outcome ofthe dense layers to adjust the metric for the distance fromequilibrium, may improve the performance. As an attempt toprovide a computational model for Piaget’s original deﬁnition of equilibrium, two mathematical measures were deployed inthe implementations and their results were compared. Imple-menting different controlling mechanisms in a machine visionsystem with any existing network architecture is now possiblethrough different deﬁnitions of equilibrium value. This pavesthe way for using any CNN variant as a follower optimizerfor another higehr-level system, such as an artiﬁcial mind.This paper proposed a computational model for the enactedvisual perception based on CNNs that deploy ﬂoating discretecontrollable ﬁlters. The design method proposed a cognitivecomputing architecture that integrates perception, memory, andattention. Deploying a high-level control signal that reﬂectsthe mind’s status allows the proposed architecture to closelymimic biological vision systems. Moreover, the proposedvisual data processing system can be easily integrated in moresophisticated systems, where an entity plays the role of brain.However, training this system may take a longer time to converge and accuracy may be slightly worse than the CNNequivalents with ﬁxed ﬁlters. Considering both advantagesand disadvantages, the proposed architecture would providea potential candidate for the machine vision systems deployedin robotic projects aimed at mimicking humans.R

EFERENCES[1] I. Goodfellow, Y. Bengio, and A. Courville,

Deep Learning . MIT press,2016.[2] J. Wu, “Introduction to convolutional neural networks,”

National KeyLab for Novel Software Technology. Nanjing University. China , vol. 5,p. 23, 2017.[3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in

Advancesin neural information processing systems , 2017, pp. 5998–6008.[4] I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le, “Attentionaugmented convolutional networks,” in

Proceedings of the IEEE Inter-national Conference on Computer Vision , 2019, pp. 3286–3295.[5] M. Merleau-Ponty,

Phenomenology of Perception . Motilal BanarsidassPublishe, 1996.[6] A. No¨e,

Action in Perception . MIT press, 2004.[7] M. Raab,

Cognition and Intelligence: Identifying the Mechanisms of theMind . Cambridge University Press, 2005.[8] J. Piaget and B. Inhelder,

The Psychology of the Child . Basic books,2008.[9] D. G. Singer and T. A. Revenson,

A Piaget Primer: How a Child Thinks .ERIC, 1997.[10] S. Haykin,

Neural Networks and Learning Machines . Pearson, 3rdedition, 2009.[11] K. Gurney,

An Introduction to Neural Networks . CRC press, 1997.[12] A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwi´nska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou et al. , “Hybrid computing using a neural network with dynamic externalmemory,”

Nature , vol. 538, no. 7626, pp. 471–476, 2016.[13] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learningdeep features for discriminative localization,” in

Proceedings of the IEEEconference on computer vision and pattern recognition , 2016, pp. 2921–2929.[14] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, andD. Batra, “Grad-cam: Visual explanations from deep networks viagradient-based localization,” in

Proceedings of the IEEE internationalconference on computer vision , 2017, pp. 618–626.[15] S. Haykin, M. Fatemi, P. Setoodeh, and Y. Xue, “Cognitive control,”

Proceedings of the IEEE , vol. 100, no. 12, pp. 3156–3169, 2012.[16] M. Fatemi, P. Setoodeh, and S. Haykin, “Observability of stochasticcomplex networks under the supervision of cognitive dynamic systems,”