Enacted Visual Perception: A Computational Model based on Piaget Equilibrium
11 Enacted Visual Perception: A Computational Modelbased on Piaget Equilibrium
Aref Hakimzadeh, Yanbo Xue, and Peyman Setoodeh
Abstract —In Maurice Merleau-Ponty’s phenomenology of per-ception, analysis of perception accounts for an element of inten-tionality, and in effect therefore, perception and action cannotbe viewed as distinct procedures. In the same line of thinking,Alva No¨e considers perception as a thoughtful activity that relieson capacities for action and thought. Here, by looking intopsychology as a source of inspiration, we propose a computationalmodel for the action involved in visual perception based onthe notion of equilibrium as defined by Jean Piaget. In sucha model, Piaget’s equilibrium reflects the mind’s status, which isused to control the observation process. The proposed model isbuilt around a modified version of convolutional neural networks(CNNs) with enhanced filter performance, where characteristicsof filters are adaptively adjusted via a high-level control signalthat accounts for the thoughtful activity in perception. While theCNN plays the role of the visual system, the control signal isassumed to be a product of mind.
Index Terms —Piaget equilibrium, schema theory, visual per-ception, convolutional neural network.
I. I
NTRODUCTION A RTIFICIAL intelligence (AI) can significantly benefitfrom the ongoing research in neuroscience and psy-chology. Looking into these fields as a source of inspirationwill pave the way for building learning machines that mimicbiological brains. CNN inspired by studying cat’s visual cortexcan be viewed as a success story in following this line ofthinking. In the past few years, CNN and its variants haveplayed key roles in building machine-vision systems. CNN-based architectures are built from a combination of layersthat implement convolution, pooling, and nonlinear functions[1]. In convolutional layers of a CNN, usually, a set of fixedfilters are used to scan the input data, hence, there is a lackof controllability over the filters [2]. Adaptively changing thefilters at each layer (possibly independent from other layers)will enhance the network’s degree of plasticity even withoutchanging the layer structure of the network. Deploying such amechanism will pave the way for building effective attentivesensors. Let us assume that we are watching a landscape (i.e.,a soccer game), sometimes we prefer to focus on a particularregion or object (i.e., ball in a soccer game), and sometimeswe are interested in a wider view. The basic CNN variants donot provide such a degree of freedom unless they are equippedwith an attention mechanism [3], [4] or a high-level controller
A. Hakimzadeh and P. Setoodeh are with the School of Electri-cal and Computer Engineering, Shiraz University, Shiraz, Iran (e-mail:[email protected]; [email protected]).Y. Xue is with the Career Science Lab, Beijing, China, and also with theDepartment of Control Engineering, Northeastern University Qinhuangdao,China (e-mail: [email protected]). to adjust the shape of filters. If a CNN-based machine-visionalgorithm is used in a more sophisticated system such as arobot or an autonomous-driving car, the high-level controlsignal for adjusting the CNN’s filters can be provided by theentity that plays the brain’s role and is superior to the visualsystem.Human’s ability to learn from visual data has unique char-acteristics that distinguish it from the state-of-the-art machinevision systems. Zoom-in and zoom-out are common proce-dures that we do every day. Most of the time we do not scanour environment precisely until something raises our curiosity.Biological visual systems use a fixed number of photoreceptorsin the retina to convert light into nerve impulses that aretransmitted to the higher layers of the visual cortex. Whilethe main neural structure involved in vision (in the sense ofthe number of photoreceptors) remains unchanged for differentmanners of looking, visual perception occurs through movingeyes around (saccade) and zooming. Hence, visual perceptioncan be better understood as an enacted perception [5], [6].Here, we aim at designing a system that is able to mimichuman adaptive visual behavior despite structural constraints.A control signal is used to control how a scene should bescreened or what part of the scene should be focused on.For instance, a control variable is needed to determine themanner of focusing on a certain object or a specific area todetect movements or anomalies. To address this issue, we lookinto psychology as a source of inspiration to define a propermind status that can be interpreted as the required controlsignal [7]. In what follows, more details will be providedon implementing the control mechanism for adjusting filtersfollowed by proposing a criterion for equilibrium. Then, theproposed model will be validated on a real dataset.The rest of the paper is organized as follows. Section IIreviews the Piaget’s schema theory and the equilibrium con-cept. Section III covers the filter design for CNN. Section IVprovides simulation results. Section V presents the concludingremarks. II. P
IAGET ’ S E QUILIBRIUM
As mentioned previously, in order to have a precise defi-nition of the control signal for guiding the visual system, itwould be helpful to use Piaget’s equilibrium. First, we recallthe definitions of schema, assimilation, accommodation, andequilibration from psychology [8]: • Schemas refer to the basic building blocks of cognitionthat make it possible to form a mental representation ofthe environment. a r X i v : . [ c s . A I] J a n • Assimilation refers to the similarity between the existingschemas and a new situation encountered. • Accommodation refers to the elements of a new situationthat are either not contained in the existing schema orcontradict them. • Equilibration refers to the balance between assimilationand accommodation. Having the intention to perform atask or find a solution to a problem calls for assimilationof information to partially match the individual’s mentalschemas as well as accommodation of information bymodifying the individual’s way of thinking to adoptit. Therefore, problem solving can be studied under anequilibration criterion [9].The notion of equilibration can be adopted in CNN-basedmachine vision systems, where a control signal is needed toreflect mind’s status. In this framework, for filters with fixednumbers of cells, the control signal will be responsible forcells’ topology arrangement. There is no need for such filtersto be fully connected, which means that there could be a gapbetween filter’s cells. To be more precise, some of the filter’scells could be null cells. According to Piaget’s definition ofequilibrium, as the distance from the equilibrium increasesin a non-equilibrium condition, filter cells should be morecondensed and filter should sweep the image faster till thedistance decreases. Then, in a close-to-equilibrium condition,a sparser filter that sweeps the image slower would work. Thesweeping speed of the filter can be adjusted by its stride. Themeasure of distance from equilibrium can be considered as avalue between zero and one, where zero reflects a completelystable mind and one means a totally unstable mind. Suchmechanisms can be embedded into the CNN architectureusing floating discrete controllable filters as described in thefollowing section.III. F
ILTER R EALIZATION
Flowchart of the training procedure for a CNN with floatingdiscrete filters (FDF) is depicted in Fig. 1. Two approaches areproposed for filter implementation using neural networks andmathematical functions: • Mathematical function: A family of spiral-shaped func-tions with adjustable interim and boundaries can be usedfor filter realization. In polar coordinates, such filtersare mathematically described as f ( θ ) = R ( θ ) e jθ , whereradius R is a function of phase θ ∈ [0 , kπ ] , and k ischosen according to the step size used for discretizationas well as the total number of filter’s cells. Choosing theform of the function R ( θ ) provides a degree of freedomin the design process. One option would be R ( θ ) = θ αn β ,where n ∈ ( − , denotes the scaled equilibrium value, α ∈ (0 , is a hyper-parameter, and β is a normalizingcoefficient. While for negative values of n , filter cellsare more concentrated around the center, for positivevalues of n , they are more scattered. Hence, adjustingthe value of n may be viewed as taking a countermeasureto compensate for the distance from equilibrium. Fig. 2shows a number of filters plotted for different values of n with α = β = 1 . It should be noted that in general, α and β will have different values. Figure 1. Training procedure for a CNN with floating discrete filters.
Fig. 1: Training procedure for a CNN with floating discretecontrollable filters. • Neural network: A multilayer perceptron (MLP) [10],[11] can be trained for filter realization. In each com-putation step, the MLP receives the equilibrium value asits input and returns the location of 25 percent of the cellsin the filter with their corresponding values as its output.The other 75 percent of the cells are considered as nullcells.One common technique in machine learning is featureencoding. Therefore, it would be informative to investigate theproposed algorithm from the perspective of hierarchical dataabstraction. Although the number of cells in a filter may beconstant during visual data processing with floating discretefilters, at each step, the information stored in weights havedifferent interpretations. Every equilibrium value leads to adifferent filter shape, and therefore, a different representationof the image in the form of filter weights. As the distancefrom equilibrium decreases and filters with more scatteredshapes can be used, weights of the network will provide a moreabstract representation of the image. In different applications,every set of weights, which corresponds to an equilibriumvalue, can be separately stored in memory and used fordecoding data at various levels of abstraction. A relationallearning scheme can be deployed for learning a bidirectionalassociation between different equilibria and filter shapes. Us-ing such a memory-augmented version of the algorithm [12], n = 0.5 n = 0.4 n = 0.3 n = 0.2 n = 0.1 n = -0.1 n = -0.2 n = -0.3 n = -0.4 Figure 2. Filter shapes 𝑓(𝜃) = 𝜃 𝛼𝑛 𝛽 𝑒 𝑗𝜃 for different values of n with 𝛼 = 𝛽 = 1 . As the distance from equilibrium increases, we will have a more concentrated filter around center. IV.
EXPERIMENTAL RESULTS
Concluding previous sections, solving mentioned problem is equivalent to solving a bi-level optimization problem defined as follows: Max O Subject to: Min ∑ (𝑦 𝑜 − 𝑦͂) Subject to: equilibrium > c Where O is the order of current filter, 𝑦 𝑜 is the output of network scanned with filter of order O, and c is a constant that determines a threshold for equilibrium. One way of solving such problems is to find an optimum value for leader problem and optimize the follower problem with respect to that value; then the output of follower will affect leader problem to find a better value. Here, filter will be designed with most scattered manner, which corresponds to the highest possible order of filter; if the proposed order satisfies Fig. 2: Filter shapes f ( θ ) = R ( θ ) e jθ for different values of nwith α = β = 1 . As the distance from equilibrium increases,we will have a more concentrated filter around center.discrete floating filters can be used for encoding visual data,and equilibrium values provide the key for decoding data.To expedite the learning process, we can start with a pre-trained network using the most scattered filter possible. Ifresults are not satisfying, distance from equilibrium increasesand filter’s shape is changed accordingly. This characteristic al-lows for using the proposed architecture for anomaly detection.Moreover, it paves the way for deriving attentive algorithms.In a recognition system, in the last layer before the softmaxlayer in the CNN, we can determine the best part of the datato pay attention to. The problem of augmenting the algorithmwith an attention mechanism can be divided into two majorsub-problems: • Class-based attention: According to [13], attention canbe implemented using the pooling operation in the lastlayer of the CNN. The class activation mapping usesthe weights connected to the nodes associated to thedetermined classes in the output layer. Hence, in thelast layer of the CNN, dimension can be matched to theinput dimension using an up-sampling mechanism. Thesame approach can be adopted by the proposed algorithmfollowed by re-scanning the highlighted part with a morecondensed filter. This process resembles how humans finda region of interest in a scene and then focus on it. • Attention-based anomaly detection: The most relevantalgorithm to this idea is the grad-CAM [14]. In a pre-trained network, a change that occurs in the input in-creases the gradient in the last layer with respect tothe previous knowledge stored in the memory. Back- propagating these gradients would cause some nodes ofthe last layer of the CNN to become more influential thanothers. By up-sampling the output to match the size ofthe input, the area that caused an increase in the distancefrom equilibrium or the part containing abnormality canbe determined. Then, a decision can be reached usingmore condensed filters on the specified area. It is similarto the class-based attention with the difference that here,the gradient of the softmax layer is tracked by the systeminstead of the weights connected to the specific outputnode. IV. E
XPERIMENTS
Floating discrete filters can be designed by solving thefollowing bi-level optimization problem:maximize O (1)subject to: minimize W (cid:88) i ( y di − y i ) subject to: equilibrium > c where O denotes the order of the current filter, y d and y are the desired output and the network output scannedby a filter of order O , respectively. c is a constant thatdetermines the threshold for equilibrium value (i.e., a measureof distance from equilibrium condition), and W refers tonetwork parameters. This problem can be solved in an iterativemanner; starting with a value for the manipulated variable ofthe leader problem, the follower problem is solved, then, fixingthe manipulated variable of the follower problem, the leaderproblem is solved to find a better value for the manipulatedvariable of the leader problem. In this way, both of the ma-nipulated variables converge to their optimal values. To designan optimal filter, we start with the most scattered structure,which corresponds to the highest possible order of the filter.If the proposed order satisfies the follower problem, processends; otherwise, order is decreased and the follower problemis checked until both levels of the problem are satisfied.The CIFAR-10 benchmark dataset was chosen to evaluatethe proposed architecture, where 60%, 20%, and 20% of datapoints were selected as the training, validation, and test sets,respectively. In order to implement the idea of floating discretefilters, the notion of Piaget’s equilibrium must be definedmathematically. For instance, equilibrium can be defined as afunction of variance of the network output, such that a highervariance represents a larger distance from equilibrium. Threesets of experiments were performed. In the first experiment,fulfilling the required conditions, the deployed floating discretefilters were handcraft designed for each discrete equilibriumvalue. In this scenario, the kernel design step in the flowchartof Fig. 1 can be bypassed. In the second experiment, filterswere designed using the mathematical functions described inthe previous section for given equilibrium values.These experiments were performed as follows. In the firstexperiment, a network was trained with normal filters, andthe trained network’s performance was evaluated based on thetest set. In the second experiment, equilibrium was initiallyset to the highest possible value (one). When network makes a prediction using designed filters, if variance of the outputlayer is acceptable, process terminates, otherwise, equilibriumvalue is decreased. Decreasing the equilibrium value results inmore concentrated filters. Then, network output is calculatedbased on the new filters, and this procedure continues untilthe network can provide a trustworthy answer, which can beinterpreted as reaching the desired threshold for variance. Dueto discretization, there exist only nine possible equilibriumconfigurations and filter shapes, hence, there will be ninepossible steps in the FDF design process.The following four criteria are considered to compare theperformance of different architectures: • Criterion 1: percentage of the correct answers predictedby FDFs (algorithm precision) • Criterion 2: normal filter prediction, which is availabkeas a candidate during the FDF design process • Criterion 3: first and last predicted answers of a singleFDF process are the same • Criterion 4: mean value of the number of times that filtershape was changed during the process (mean numbersteps per instance)In the first experiment, three different networks with differ-ent architectures were trained. For each network, six differentthresholds were considered for variance, which determines thestopping criterion, and for each threshold, the mentioned fourcriteria were calculated as percentage. The first architecturehas the following characteristics: • First layer: a CNN with 3x3 filter size and one outputfilter • Second layer: a fully connected layer with 1024 neurons • Output layer: a layer with 10 neuronsResults are summarized in Table I. Precision of this networkwith normal filter shape was 43.99%. Number 9 appearing inthe last row for criterion 4 shows that uncertainty was presenttill the designed filter took a shape similar to the normal filter.This condition can be interpreted as reaching zero value forequilibrium. First column of the table shows a precision closeto that of normal filter while the mean number of steps foreach instance is near to one. This means that using a filter,which is twice larger than the normal one and reduces thecomputational burden in each layer up to 50%, could achievea precision close to the typical process. Considering only thefirst output of the FDF process and ignoring the rest, which isequivalent to designing the filter with an equilibrium value ofone, leads to similar predictions for both FDF and normal filter83.81% of the time. This confirms the generality of designedfilters regadless of the process functionality, which is reflectedin criterion 3.The second architecture used in the first experiment has thefollowing characteristics: • First layer: a CNN with 3x3 filter size with three outputfilter • Second layer: a fully connected layer with 1024 neurons • Output layer: a layer with 10 neuronsResults are summarized in Table II. Precision of this networkwith normal filter shape was 52.47%. In this table, the secondcolumn has benefits of having near normal filter precision and small mean number of steps, which confirms reducedcomputational effort while achieving nearly the same levelof performance. Regarding generalization, by just consideringthe first-stage prediction of FDF, in 72.47% of the time,predictions would be the same as those of the normal filter.The third architecture used for the first experiment has thefollowing characteristics: • First layer: a CNN with 3x3 filter size with eight outputfilter • Second layer: a fully connected layer with 1024 neurons • Output layer: a layer with 10 neuronsResults are summarized in Table III. Precision of this networkwith normal filter shape was 57.9%. Similar to the previoustwo cases, for this network, reduced computational burdenleads to faster performance.For the second experiment, an architecture similar to the firstarchitecture was used and a kernel was designed based on themathematical functions described in the previous section. Thefollowing architecture was used for this experiment: • First layer: a CNN with 3x3 filter size with one outputfilter • Second layer: a fully connected layer with 1024 neurons • Output layer: a layer with 10 neuronsResults are summarized in Table IV. By comparing tables Iand IV, which correspond to the same architecture, we canconclude that the mathematical filter design approach showspromising results and achieves normal filter precision withless computational effort. The third row of Table IV showsthe generality of this approach, which can be used for datacompression or even abstraction without loss of information.Results show that increasing the complexity of the networkwill increase the gap in performance between architecturesthat use FDFs and normal filters. It was expected becausethe FDF process was designed to improve the generalizationability of CNNs via keeping a simple structure and providingflexibility to filters, while maintaining an acceptable levelof performance. However, increasing complexity is againstthe main idea behind the FDF design process. Improvingperformance of FDF must be achieved through better definitionof the equilibrium value. In this section, a simple interpretationof equilibrium in terms of variance was implemented to showthe effectiveness of the proposed method.In the third experiment, entropy of the output layer was usedinstead of variance to represent equilibrium, such that a higherentropy demonstrates a less confident prediction and hencea lower equilibrium value [15], [16]. Two case studies wereinvestigated with network architectures and configurationssimilar to the first case in the first experiment and the secondexperiment except for using entropy instead of variance.Results are summarized in tables V and VI, respectively.Networks can achieve better precision and faster processingin this case. In addition to improvement in the processingtime, using entropy to represnt equilibrium improved precisioneven beyond what was achievable by the CNN with normalfilters using variance (Tables V and VI cover a wider range ofequilibrium thresholds compared to tables I to IV). Therefore,entropy can provide a valid stopping criterion during thr
TABLE I: Results of experiment 1, case 1.
Variance Threshold 0.05 0.07 0.08 0.088 0.09 0.1Criterion 1 42.06 41.53 41.23 40.69 43.99 43.99Criterion 2 84.60 85.31 85.86 87.06 99.56 99.56Criterion 3 95.54 91.43 88.74 83.68 83.93 83.93Criterion 4 1.089 1.227 1.356 1.703 9 9
TABLE II: Results of experiment 1, case 2.
Variance Threshold 0.05 0.07 0.08 0.088 0.09 0.1Criterion 1 34.38 48.95 48.57 48.98 49.42 52.74Criterion 2 41.12 76.46 78.02 81.96 86.86 99.49Criterion 3 90.03 86.22 83.1 79.06 76.09 72.37Criterion 4 1.36 1.756 2.42 3.412 4.714 9
TABLE III: Results of experiment 1, case 3.
Variance Threshold 0.05 0.07 0.08 0.088 0.09 0.1Criterion 1 51.23 50.34 49.7 49.02 57.9 57.9Criterion 2 75.0 76.19 76.99 79.19 99.44 99.44Criterion 3 94.13 89.56 86.54 81.28 73.51 73.51Criterion 4 1.185 1.448 1.719 2.465 9 9
TABLE IV: Results of experiment 2.
Variance Threshold 0.05 0.07 0.08 0.088 0.09 0.1Criterion 1 43.71 43.61 43.62 43.45 43.37 43.99Criterion 2 82.25 83.26 84.03 85.88 89.39 99.32Criterion 3 95.7 92.42 90.59 87.02 83.43 81.16Criterion 4 1.126 1.333 1.543 2.12 3.32 9
TABLE V: Results of experiment 3, case 1.
Entropy Threshold 10e-11 0.001 0.005 0.02 0.2 0.3Criterion 1 45.0 44.3 43.7 42.9 40.9 41.8Criterion 2 80.8 81.1 81.4 82.7 84.7 89.5Criterion 3 100 97.9 96.1 90.2 82.0 80.4Criterion 4 1.03 2.02 2.41 3.04 3.88 5.021
TABLE VI: Results of experiment 3, case 2.
Entropy Threshold 10e-11 0.001 0.005 0.02 0.2 0.3Criterion 1 42.6 41.12 39.3 38.3 38.5 40.0Criterion 2 84.45 84.45 83.5 83.9 85.6 90.0Criterion 3 99.75 95.87 90.5 83.1 77.7 77.8Criterion 4 1.053 1.47 2.04 2.59 3.45 4.38 training phase.Although the two proposed mathematical representationsof equilibrium were fairly simple, experiments showed theeffectiveness of the proposed approach for improving theflexibility of CNNs. Hence, it can be viewed as a step towardsdesigning a proper high-level controller for mimicking humanvisual system. V. C
ONCLUDING R EMARKS
Performance of a machine vision system can be enhancedby rescanning data using a subset of filters after the answeris known in order to fine-tune some of the filter weights. Inthis way, the system can recognize the picture with feweriterations and a wider filter. Moreover, using the outcome ofthe dense layers to adjust the metric for the distance fromequilibrium, may improve the performance. As an attempt toprovide a computational model for Piaget’s original definition of equilibrium, two mathematical measures were deployed inthe implementations and their results were compared. Imple-menting different controlling mechanisms in a machine visionsystem with any existing network architecture is now possiblethrough different definitions of equilibrium value. This pavesthe way for using any CNN variant as a follower optimizerfor another higehr-level system, such as an artificial mind.This paper proposed a computational model for the enactedvisual perception based on CNNs that deploy floating discretecontrollable filters. The design method proposed a cognitivecomputing architecture that integrates perception, memory, andattention. Deploying a high-level control signal that reflectsthe mind’s status allows the proposed architecture to closelymimic biological vision systems. Moreover, the proposedvisual data processing system can be easily integrated in moresophisticated systems, where an entity plays the role of brain.However, training this system may take a longer time to converge and accuracy may be slightly worse than the CNNequivalents with fixed filters. Considering both advantagesand disadvantages, the proposed architecture would providea potential candidate for the machine vision systems deployedin robotic projects aimed at mimicking humans.R
EFERENCES[1] I. Goodfellow, Y. Bengio, and A. Courville,
Deep Learning . MIT press,2016.[2] J. Wu, “Introduction to convolutional neural networks,”
National KeyLab for Novel Software Technology. Nanjing University. China , vol. 5,p. 23, 2017.[3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in
Advancesin neural information processing systems , 2017, pp. 5998–6008.[4] I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le, “Attentionaugmented convolutional networks,” in
Proceedings of the IEEE Inter-national Conference on Computer Vision , 2019, pp. 3286–3295.[5] M. Merleau-Ponty,
Phenomenology of Perception . Motilal BanarsidassPublishe, 1996.[6] A. No¨e,
Action in Perception . MIT press, 2004.[7] M. Raab,
Cognition and Intelligence: Identifying the Mechanisms of theMind . Cambridge University Press, 2005.[8] J. Piaget and B. Inhelder,
The Psychology of the Child . Basic books,2008.[9] D. G. Singer and T. A. Revenson,
A Piaget Primer: How a Child Thinks .ERIC, 1997.[10] S. Haykin,
Neural Networks and Learning Machines . Pearson, 3rdedition, 2009.[11] K. Gurney,
An Introduction to Neural Networks . CRC press, 1997.[12] A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwi´nska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou et al. , “Hybrid computing using a neural network with dynamic externalmemory,”
Nature , vol. 538, no. 7626, pp. 471–476, 2016.[13] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learningdeep features for discriminative localization,” in
Proceedings of the IEEEconference on computer vision and pattern recognition , 2016, pp. 2921–2929.[14] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, andD. Batra, “Grad-cam: Visual explanations from deep networks viagradient-based localization,” in
Proceedings of the IEEE internationalconference on computer vision , 2017, pp. 618–626.[15] S. Haykin, M. Fatemi, P. Setoodeh, and Y. Xue, “Cognitive control,”
Proceedings of the IEEE , vol. 100, no. 12, pp. 3156–3169, 2012.[16] M. Fatemi, P. Setoodeh, and S. Haykin, “Observability of stochasticcomplex networks under the supervision of cognitive dynamic systems,”