Arguments for the Unsuitability of Convolutional Neural Networks for Non--Local Tasks
AArguments for the Unsuitability of Convolutional NeuralNetworks for Non–Local Tasks
Sebastian Stabinger , David Peer, Antonio Rodríguez-Sánchez Universität Innsbruck, Technikerstrasse 21a, 6020 Innsbruck, Austria
Abstract
Convolutional neural networks have established themselves over the past yearsas the state of the art method for image classification, and for many datasets,they even surpass humans in categorizing images. Unfortunately, the samearchitectures perform much worse when they have to compare parts of an imageto each other to correctly classify this image.Until now, no well-formed theoretical argument has been presented to ex-plain this deficiency. In this paper, we will argue that convolutional layers areof little use for such problems, since comparison tasks are global by nature, butconvolutional layers are local by design. We will use this insight to reformulatea comparison task into a sorting task and use findings on sorting networks topropose a lower bound for the number of parameters a neural network needsto solve comparison tasks in a generalizable way. We will use this lower boundto argue that attention, as well as iterative/recurrent processing, is needed toprevent a combinatorial explosion.
Keywords: , convolutional neural networks, sorting networks, relationalreasoning, attention, locality
1. Introduction
Being able to compare objects in a scene and making decisions based onthat information is an essential skill for humans, who can compare completelynovel shapes and objects without being familiar with them, something whichfor example Fleuret et al. (2011) were able to show using the SVRT dataset.Since 2012, deep learning and Convolutional Neural Networks (CNNs) haveemerged as state-of-the-art methods in computer vision. It is therefore naturalto extend the use of such networks to tasks involving judgments about similarityand identity.Unfortunately, experiments by us (Stabinger et al. (2016)) have shown thatCNNs perform very poorly on classification tasks that require comparison of [email protected] Preprint submitted to Neural Networks February 25, 2021 a r X i v : . [ c s . C V ] F e b a) Class 1 (b) Class 2Figure 1: Examples of the two classes of problem 1 from the SVRT dataset by Fleuret et al.(2011). For class 1 the two shapes are different, for class 2 they are identical. shapes. Although multiple authors have confirmed this problem of CNNs bynow (e.g. Ricci et al. (2018) and Kim et al. (2018)), to our knowledge, noconvincing theoretical arguments on why comparison tasks are so difficult forthese architectures have been proposed in detail.In this paper, we will try to shed some light on this class of comparison prob-lems by analyzing one specific task in detail. We will try to convince the readerof three important aspects regarding comparison tasks: 1) They are inherentlydifficult for CNNs, 2) Attention drastically reduces the size of a network neededto solve such tasks and 3) Iterative processing further reduces the complexityof the task considerably. In Section 2, we will present the current research onsolving comparison tasks using CNNs. The identity task , which we will concen-trate our analysis on, will be presented in Section 3. The same section will alsopresent a theoretical framework to analyze such tasks. Section 4 will use thisframework to introduce the concept of the locality of a task, which we will usein Section 5 to argue that the convolutional layers are of limited usefulness forsolving the identity task . We will also show, that the identity task can, in theoptimal case, be reduced to sorting a list of numbers in the fully connected partof a network. In Section 6, we will use research on sorting networks to propose away to generate neural networks to sort numbers and offer a lower bound on thenumber of parameters such a neural network needs. Experiments in Section 7will fortify some of the theoretical arguments with practical results. In Section8 we will present conclusions and discuss our findings.
2. Related Work
The SVRT dataset by Fleuret et al. (2011) consists of multiple problems,each built around the classification of abstract images containing shape outlines.For some problems, it is necessary to compare the shapes to each other to beable to classify the image correctly. For example, in problem 1 of the SVRTdataset, two shapes are visible in each image. The image belongs to class 1 ifboth shapes are different, and to class 2 if they are identical (see Figure 1 for afew example images). 2 igure 2: Examples from the PSVRT dataset by Kim et al. (2018) that offer two classificationtasks with the same images either classifying identity of the patches or their orientation toeach other.
Using this SVRT dataset, we could show in Stabinger et al. (2016) showedthat CNNs struggle with solving tasks that require the comparison of shapes.Ricci et al. (2018) later came to the same conclusion and Kim et al. (2018) ex-tended this work by introducing the PSVRT dataset (see Figure 2 for examples)using randomly generated checkerboards instead of shape outlines.Further research by Messina et al. (2019) showed that ResNet by He et al.(2016) as well as CorNet-S by Kubilius et al. (2018) can solve some of thepreviously unsolved problems of the SVRT dataset, but needed to use 400.000training images to do so. Funke et al. (2020) finally were able to achieve accura-cies above 90% on all problems of the SVRT dataset using a ResNet architecturewith just 28.000 images.The fact that the SVRT dataset has been solved by Funke et al. (2020) prob-ably has more to do with shortcomings of the dataset and less with neural net-works being able to solve the underlying comparison problem. Our experimentswith a seemingly easier dataset in Section 7.3 seem to support this hypothesis.In our opinion, the SVRT dataset has two shortcomings:
First , the images for the different problems also have different complexity(e.g. the number, size and distribution of the shapes). Therefore, it is hard tojudge whether a system struggles because the actual problem is harder to solve,or because the images have higher variability than images for other problems(e.g. there are more shapes, the size of the shapes is different, the positionsof the shapes are more variable, ...). Kim et al. (2018) created the PSVRT(see Figure 2) dataset to take image complexity out of the equation by usingthe same images for two different tasks (spatial orientation or identity). Theauthors used PSVRT to confirm that tasks involving comparison are truly moredifficult to learn than those involving other relations (like spatial relations inthe PSVRT case).
Second , even an approximate comparison of the shapes of the SVRT datasetis sufficient to claim identity. Since the shapes are generated entirely at random,3 a) Class one (b) Class twoFigure 3: Example images from the identity task, rescaled for better visibility. Since in a),the same patch appears three times it is an example of class one. For b), no patch occurs atleast three times, therefore it is an example of class two. different shapes, in most cases, do not resemble each other even on a coarselevel. Thus, two shapes that roughly look-alike will be identical with quite ahigh probability. Because of this, even a small number of training images mightbe sufficient to have seen all approximate shapes and the separation betweentraining and testing set can not be ensured anymore. This means that it is hardto judge whether a system has learned a task or has just memorized the trainingsamples.Humans, as well as more classical machine learning approaches like supportvector machines with handcrafted features, do not show this systematically lowerperformance on tasks involving comparison, which was shown by Fleuret et al. intheir original paper.
3. The Identity Task
Because of these shortcomings, we will use a simplified task in this paperthat was influenced by the identity task of the PSVRT dataset by Kim et al.(2018). The task consists of images I with dimension N × N where each pixelcan have one of two possible states ∈ { , } . Each image contains a randomlychosen amount c ≥ of non-overlapping, randomly positioned patches P of size n × n , where n (cid:28) N . The patches are filled with random pixels, according to aBernoulli distribution with p = 0 . and P xy ∈ { , } . Each image is categorizedinto one of two classes. An image is of class one if at least half of the patchesare identical, and class two otherwise. Two patches P i and P j are identical iff ∀ x, y ∈ { , ..., n } : P ixy = P jxy . The goal of this task is to decide which classa given image belongs to, given M image/class pairs for training. We call thisclassification task and the accompanying dataset the identity task . Exampleimages from this dataset can be seen in Figure 3. In this section, we will present arguments why convolutional layers are oflimited usefulness for solving the identity task. As an abstraction of our actual4ask, let us assume that some system has to compare two image patches anddecide whether they are identical or not.We assume that all patches that should be considered identical are assignedone unique symbol by a function E( p ) where p is the patch to be mapped to asymbol. We will call this function the encoder . The set of all possible symbolsfor a task will be named S so that ∀ p E( p ) ∈ S . One symbol therefore encodes log | S | bit of information, where | S | signifies the number of symbols in S .The task of comparing two patches for identity can now be solved by adecision function D( s , s ) taking in two symbols. This function detects whether s and s are identical or not (see Equation 1), and we will call it the decision–maker . D( s , s ) = (cid:40) same iff s = s different otherwise (1)Deciding whether the patches p and p are identical therefore is reduced to thequestion of whether the result from D(E( p ) , E( p )) is "same" or "different".In our identity task, the goal is not to compare two patches for identity,but to compare c patches. A decision–maker, processing all patches in onestep, therefore, would need c symbols as input and therefore c log | S | bit ofinformation from the encoders. We will now investigate the following question:Can we somehow reduce the amount of information the decision–maker needsto solve a given task by some form of hierarchical preprocessing? We define a preprocessor P i ( s , s ) as a function that takes two symbols froma set of symbols S i − as arguments and returns a single symbol from a new setof symbols S i . A preprocessor, therefore, performs lossless compression of thereceived information with respect to the given task. How much a preprocessorcan compress the incoming information depends on the task that has to besolved. The ratio between the amount of information going into a preprocessorand the amount of information being passed along to higher-level preprocessorswill be called the compression factor , defined as C( P i ) = 2 log | S i − | log | S i | This compression factor directly correlates to the amount of processing thatcan be performed in the preprocessor P i . A high compression factor indicatesthat a lot of the processing can be done in P i , and a compression factor of indicates that no processing can occur in P i and all the incoming information hasto be passed along unchanged. Preprocessors can be stacked in a hierarchicalarrangement (see Figure 4). 5 igure 4: Example of a hierarchical arrangement of encoders E , preprocessors P and P , anda decision–maker D for the presented identity task
4. Locality
Given a preprocessor P i , we define the receptive field size rfs( P i ) as thenumber of inputs its output, directly or indirectly, depends on. E.g. for thehierarchical structure from Figure 4, the first layer of preprocessors ( P ) willhave a receptive field size of 2, the next layer will have a receptive field size of4 and so forth (i.e. rfs( P i ) = 2 i )We will now investigate how the minimum amount of information that isneeded to fully represent the content of a specific receptive field changes withrespect to its size. This minimum amount of information is equivalent to thenumber of output symbols a preprocessor in our hierarchical processing schemeneeds.If the minimal amount of information needed to represent a receptive fielddoes not depend on the receptive field size, we have perfect locality (i.e. all theprocessing can be done on a local scope). If the amount of information growsdirectly proportional to the receptive field size we have no locality at all (i.e. allprocessing has to be done on a global scope by the decision-maker).Therefore, we define the locality L of a task as the limit of the compressionfactor of our preprocessors while going up the hierarchy and therefore simulta-neously increasing the receptive field size of the preprocessors (we assume thatthe receptive field size and the depth of the hierarchy are not limited): L = lim i →∞ C( P i ) − Informally, taking the identity task as an example, this can be seen as havingan infinite number of patches to analyze. Therefore, the hierarchy of Figure 4would have to have infinite depth, and the locality is the compression factor thepreprocessors approach while going up this infinite hierarchy, minus one.A locality of indicates that no local processing can be performed if thereceptive field becomes very big (i.e. the preprocessor is not able to compressanything). A locality L < can not occur since it is always possible to pass6long the full incoming information. A locality of means that all processingcan be done on a local level, i.e. the preprocessor can compress two symbolsfrom S i − to one symbol from S i and the amount of information needed torepresent one symbol from S i − and S i is the same ( | S i − | = | S i | ). In otherwords, the amount of information needed to represent a receptive field does notdepend on its size. A locality L between and indicates that a varying degreeof processing can be performed on a local level. Hypothesis:
Problems exhibiting low locality are ill fitted to be solved byConvolutional Neural Networks (CNNs)Intuitively, CNNs are ill suited to solve non-local tasks, since the convolutionalpart of the network is local by design.
As previously described, we assume that c patches of size n × n , consisting ofblack and white pixels, are given, and the task is to decide whether at least halfof the patches are identical or not. Since two patches are only considered equalif they contain the same pattern, the encoder needs to use n symbols in S topreserve comparability. Thus, one such symbol encodes n bit of information.The preprocessors of the first layer P each get n bit of information asinput, assuming that each possible pattern can occur with the same probability.No matter whether the two incoming symbols are the same or different, theoutput has to encode all possible combinations of incoming symbols since nodecision about which patch has to be available for comparison at a later stagecan be made at this point. Since the order of symbols is not relevant (i.e. P ( s , s ) = P ( s , s ) ), the preprocessor has to use | S | − (cid:0) | S | (cid:1) symbols for S . Intuitively, one would assume that at least some information can be left outin cases where a preprocessor receives the same symbol twice (meaning that thetwo patches are identical), but this is not the case. The information that thesame patch was received twice needs one symbol as a representation, the sameway that any other combination of received patches needs exactly one symbolto represent.This gives us the following compression factor for P : C ( P ) = 2 log | S | log (cid:16) | S | − (cid:0) | S | (cid:1)(cid:17) (2)The preprocessors of the next layer P again have to use one symbol in S for all possible pairs of symbols from S , ignoring the order of symbols. Thisprocedure does not change for any of the following layers of preprocessors, andwe can define a general formula for the compression factor of any preprocessoras C ( P i ) = 2 log | S i − | log (cid:16) | S i − | − (cid:0) | S i − | (cid:1)(cid:17) (3)7or better readability we define s = | S i − | . Since the number of symbols in S i monotonically increases while going up the hierarchy, we can rewrite the localityof this task as follows: L = lim s →∞ s log (cid:16) s − (cid:0) s (cid:1)(cid:17) − s →∞ s log (cid:16) s − s − s (cid:17) − (4) s →∞ dd s log s dd s log (cid:16) s − s − s (cid:17) − s →∞ s + 12 s + 1 − (5)With a locality of 0, we can see that processing can only happen once the systemhas reached a global receptive field.
5. Application to the Identity Task
We will now see how the theoretical architecture of encoder, preprocessorand decision–maker can be translated to the different parts of a ConvolutionalNeural Network (CNN). Commonly, a CNN for classification is constructed fromtwo main parts: A varying amount of convolutional and pooling layers extractingfeature descriptors with increasingly global scope and a part of fully connectedlayers generally considered to use those features for the actual classification.The symbols the encoder is extracting can be interpreted as the activations ofthe neurons of some of the first layers in the CNN where the receptive fields of theneurons are still relatively small. The following convolutional and pooling layerscan be interpreted as the preprocessors that compress and extract informationas the receptive field becomes more extensive, either through pooling or throughthe stride of the convolutions. The following fully connected layers have a globalreceptive field and can be seen as the decision–maker, taking the extractedinformation from the convolutional layers and calculating the class probabilities.As we have shown in Section 4, the only operation that can be performedon a local scope is extracting the information contained in the patches, andforwarding this information to the decision–maker. Therefore, the best theconvolution and pooling part of the network can do is to find an efficient, losslessencoding of the information contained in each patch. If we allow for arbitraryprecision real numbers, a convolution can in principle completely encode thecontent of a patch in a single real-valued output. Those real-valued outputs canbe interpreted as the symbols forwarded to the decision–maker.A kernel of size n × n that is capable of capturing all information of apatch with binary pixels (i.e. black or white) in a single real number can beconstructed by setting the n weights of the kernel to ∀ i ∈ , ..., n : w i = i .This ensures that each bit of the incoming patch is represented by exactly onebit of the fraction bits of the resulting floating-point number. See Figure 5 foran example of a × kernel constructed in this manner.8 igure 5: Example of a kernel that is able to fully encode an incoming binary patch of size × in its output. There are two possibilities when extracting the patches in the convolutionallayers. If some attentional mechanism allows the convolutional layers to extractexactly one number for each patch, we end up with c (the number of patches)numbers that need to be compared. Since we have to design our network in away so that the worst case can also be handled, we have to assume the maximalnumber of patches that are possible. This means that we have to allow for (cid:0) N/n (cid:1) numbers to be compared where N is the width and height of the imagein pixels and n is the width and height of the patch in pixels. If no attentionalmechanism is present, the convolutional layers will have to generate one numberfor each possible patch position, resulting in ( N − numbers that have to beprocessed.In reality, the situation is even worse since the precision of floating-pointnumbers is limited. Using a 32-bit floating-point number (the most commonsize used for neural networks) only 24 bit can be reliably encoded with thepresented method. Patches above a size of × would therefore have to beencoded by multiple numbers.After the convolution and pooling part of the network, in the most favourablecase, we would have several inputs to our fully connected layers, and each inputencodes the information for each patch (in the case with attention) or eachpossible patch position (in the case without attention). The most efficient wayof detecting the amount of identical numbers in a list is to sort the list first.Once the list is sorted, identical numbers can be identified by merely checkingneighbouring numbers for equality. In the experimental section, we will showthat this also holds in practice using neural networks.The question now is, how a neural network can sort numbers and how big ithas to be to do so. The next section will start with sorting networks to estimatea lower bound on the number of parameters needed to achieve sorting using aneural network under the assumption that it is constructed from fully connectedlayers.
6. Sorting Networks
A sorting network by O’connor & Nelson (1962) consists of wires and com-parators . Wires "transport" comparable values (e.g. real numbers in our case).9 L L W W W W Figure 6: Example of a sorting network that is able to sort four real numbers in descendingorder. The horizontal lines represent the wires, the vertical lines and black dots represent thecomparators. Values flow along the wires from left to right. Swapped values are highlightedand the three layers of the network are labeled L through L . Pairs of wires can be connected by comparators that swap the values transportedon the wires if they are not already in the correct order. Multiple comparatorscan swap values in parallel as long as each wire is only connected to one com-parator. We will call such a parallel evaluation of comparators a layer . A sorting network is a fixed arrangement of comparators and wires so that anycombination of possible values sent along the wires is sorted after passing allcomparators. Figure 6 shows a sorting network in operation. Sorting networkstake the task of sorting - which is usually perceived as an iterative process - andconverts it into a highly parallel, purely feed-forward problem.The number of layers of a sorting network is also called its depth. Exten-sive research exists on the theoretical properties of such sorting networks, andlower bounds on their depth have been published. There exists an information-theoretical lower bound of log n regarding the depth of a sorting network tosort n numbers and Kahale et al. Kahale et al. (1995) were able to tighten thatlower bound to ( c − o(1)) log n with c ≈ . .Since we want to transfer the knowledge on how to construct a sorting net-work to neural networks, we first have to find a way to implement a comparatorusing a neural network. We assume the neural network has two inputs x and x in the range [0 , and two outputs y and y . We expect the following be-havior: y = x and y = x in case x ≥ x and y = x , y = x in case of x < x . Figure 7 shows a minimal implementation of such a comparator neuralnetwork. The implementation assumes max(0 , x ) (a rectified linear unit) as theactivation function of the neurons and biases are not needed. The followingequations describe the hidden neurons: z = max(0 , x ) → z = x z = max(0 , x ) → z = x z = max(0 , x − x ) z z x x y y - -1 Figure 7: One possible, minimal implementation of a comparator as a neural network.
So if x ≥ x then z = 0 otherwise z = x − x . This means we can calculate y and y by y = z + z = x + max(0 , x − x ) y = z − z = x − max(0 , x − x ) In case of x ≥ x y = x + 0 = x y = x − x In case of x < x y = x + x − x = x y = x − x + x = x Which is the expected behavior of a comparator.
A sorting network can trivially be used to create a neural network for sortingnumbers. For each layer of the sorting network, two hidden layers in the neuralnetwork of appropriate width are created. The wires are implemented by passingvalues through neurons with a weight of 1 (i.e. the values are unchanged). Acomparator in the sorting network is replaced by a comparator neural networkafter which the two numbers passed through will be sorted. Figure 8 shows aminimal sorting network to sort three numbers and Figure 9 the correspondingneural network.As we have shown above, one comparator, implemented using a neural net-work, needs three layers. However, since the input and output layer of compara-tors following each other can be combined, we need two fully connected layers inour neural network for each layer in the sorting network. The first layer needs11 L L x x x y y y Figure 8: Example of an optimal sorting network for three numbers. x1x2x3 y1y2y3 - - - -1 1 - - L L L Figure 9: Example of a neural network with 72 parameters, capable of sorting three numbers,constructed using the sorting network from Figure 8. Weights with a value of . are omittedfor clarity. (cid:100) . x (cid:101) neurons and the second needs x neurons to implement one layer of asorting network which sorts x numbers. The number of parameters p needed forsuch a neural network, that implements a sorting network with depth d , sorting x numbers are: p = 2 d (cid:100) . x (cid:101) x (6)Slightly fewer parameters can be achieved, depending on the exact structure ofthe sorting network. More specifically, if a layer of the sorting network passesthrough values without processing (i.e. in cases where a line is not connected toany comparator), some neurons can be saved. See Figure 9 as an example: Oneneuron can be saved for each second neural network layer because every layerof the sorting network passes though one value unchanged. If we assume thatthe neural sorting network has to be learned from data, we can not assume anarchitecture that is so closely fitted to the problem, and the parameter countof Equation 6 is more realistic than the version with neurons removed fromindividual layers.Experimental evidence suggests that neural networks constructed in thismanner are minimal solutions to the sorting problem when using fully con-nected layers with rectified linear units as activation functions and the numberof parameters from Equation 6 is close to the minimum. The smallest networkwe could train to sort three numbers needs 97 parameters, which is higher thanour proposed lower bound from Equation 6 of 90 parameters.12 L Figure 10: A sorting network that is able to sort a list of six numbers by repeated application.
Using a typical input image size of × pixels, the amount of numbersto process depends, as previously mentioned, on whether there is some formof attentional mechanism or not. With attention , the system will be able todetect which part of the image is a patch and which one is not. Therefore, wehave to be able to process the maximal amount of patches that could fit into animage without overlapping. Assuming a patch size of × , we need to process amaximum of 784 patches, and therefore numbers. Without attention , the systemhas to be able to process all 46,656 possible patch positions.Assuming the smaller, information-theoretic, lower bound on the numberof layers needed in a sorting network, we need ten layers and 16 layers withand without attention respectively. Following Equation 6, the neural sortingnetworks need ≈ . million parameters with and ≈ . billion parameterswithout an attention mechanism. This difference in parameters shows howimportant attention is in such cases to prevent a combinatorial explosion. Thisneed for attention, to bring the complexity of vision tasks down to practicallevels, was already shown by Tsotsos (1988). We have previously mentioned, that we suspect that iterative processingshould, similarly to attention, considerably reduce the complexity of the identitytask. This advantage of iterative processing is connected to the problem thatpurely feed forward networks have with sorting numbers.If we are allowed to send a list of numbers through a sorting network andapply the same sorting network to the resulting list of numbers repeatedly, wecan sort any list of numbers with a network with only two layers. Such a networkcan be implemented by connecting every second pair of wires with a comparator,starting with the first wire in layer one, and starting with the second wire forlayer two (see Figure 10 for such a sorting network for six numbers).Sending numbers through this sorting network ensures that each number willmove at least one step in the direction of its correctly sorted position. Therefore,a list of x numbers will be completely sorted after applying this sorting network x times. 13e can convert this sorting network into a neural network, using the pro-cedure form Section 6.1. Using Equation 6, we can calculate that a neuralnetwork that can be used to sort x numbers using iterative (recurrent) process-ing, needs (cid:100) . x (cid:101) x parameters. This reduces the network size for our identitytask to ≈ . million parameters with and ≈ . billion parameters without anattention mechanism.Suppose we also allow for iterative processing of the list itself. In that case,the sorting problem can be solved by iteratively applying the same comparatoron each pair of numbers of the list. This approach is in effect an implementationof bubble sort and can be solved with a single neural network with only 12parameters.This reduction in parameter counts does not only reduce the amount ofresources such neural networks need, it also drastically reduces the amount oftraining data needed and leads to better generalizability of the networks.In Section 7.1, we will show experimentally how useful iterative (recurrent)processing is for our proposed problem.
7. Experimental Evaluation
In the following sections, we will substantiate our theoretical findings withexperimental results. We will show that detecting identical numbers in a list ofnumbers (which is, as we showed, the most efficient input the convolutional partof a CNN can give to the fully connected part for the identity task) becomesmuch easier, once the list of numbers is sorted. We will also show that recurrentnetworks perform much better at this task. Besides, we will highlight that anattentional mechanism makes the identity task much easier to solve.
As we have shown in Section 5, the identity task boils down to the decisionif at least half of the numbers of a list of numbers are identical.To show that sorting this list decreases the difficulty of the problem in prac-tice, we ran experiments where we used fully connected neural networks to solvethis classification problem. The list of numbers was either sorted or unsortedbefore being given to the neural network for classification. Different architec-tures of up to ten layers with 100 neurons in each layer were tested for listsof ten numbers. The dataset consisted of 4,000 randomly generated lists fortraining and 1,000 for testing. Binary cross entropy was used as a loss, the net-works were optimized using Ranger by Wright (2019) and ReLU by Hahnloseret al. (2000) was used as an activation function for all but the last layer (whichused a logistic sigmoid function). Weights were initialized following the methodproposed by He et al. (2015).Since we were not able to get an accuracy above . with the unsorted list,we also tried to solve this task using , training samples. Looking at theresults in Table 1, it is easy to see that the task is much easier, once the list issorted. Even with one hidden layer and only ten neurons, we were able to achieve14 able 1: Accuracy of different fully connected neural network architectures, when trying toclassify whether a list of ten numbers contains the same number at least five times. Thelist is sorted or unsorted and either a small or large training set is provided. Many morecombinations were tested and only the more interesting combinations are shown. Input Hidden Layersize Number of Test TrainLayers Parameters Accuracy Accuracysorted small 1 10 132 0.99 0.99sorted small 1 100 1302 1.0 1.0unsorted small 1 500 6502 0.77 0.98unsorted small 1 1000 13002 0.80 1.0unsorted small 2 100 11402 0.79 1.0unsorted small 2 500 257002 0.80 1.0unsorted small 2 1000 1014002 0.79 1.0unsorted small 10 100 92202 0.75 1.0unsorted large 10 5 337 0.76 0.77unsorted large 10 10 1122 0.84 0.85unsorted large 10 50 23602 0.91 0.94almost perfect accuracy for sorted lists. On the other hand, for the unsortedlist, only a network with ten layers and 50 neurons for each layer and a trainingset of 50,000 samples was able to achieve an accuracy above . . These resultsshow that once the list of numbers is sorted, the task itself becomes very easy.As previously mentioned, we hypothesize that iterative processing wouldsolve many of the shortcomings of neural networks for these kinds of problems.To test this hypothesis, we also tested the previous dataset with a Long ShortTerm Memory (LSTM) architecture by Hochreiter & Schmidhuber (1997), whichcontains recurrent connections and can iteratively process data because of this.The results of this architecture for unsorted lists can be seen in Table 2. It isevident that a recurrent architecture has two advantages when solving this kindof problem: First, the networks need far fewer parameters and second, theycan solve the problem using less training data because they generalize muchbetter, which is a consequence of them being able to solve the problem withfewer parameters. In Section 6.1, we have shown how sorting networks can be used to constructneural networks that can sort numbers. Since we expect our networks to learnthe sorting operation as part of a more extensive network, we will experimentallydetermine how many parameters neural networks need to be able to learn thesorting operation from data alone.As training data, the networks were provided with an unlimited supply ofunsorted vectors containing numbers from to as input and the correctlysorted vectors as training targets. The networks consisted of a variable amount15 able 2: Accuracy of different LSTM neural network architectures, when trying to classifywhether a list of ten numbers contains the same number at least five times. The lists wereunsorted. Training Layers Hidded State Number of Train TestSamples Size Parameters Accuracy Accuracy4000 1 5 152 0.84 0.814000 1 10 502 0.91 0.884000 1 15 1052 0.92 0.884000 2 5 374 0.87 0.874000 2 10 1342 0.91 0.894000 2 15 2912 0.95 0.9240000 1 5 152 0.86 0.8640000 1 10 502 0.93 0.9340000 1 15 1052 0.95 0.9540000 2 5 374 0.87 0.8740000 2 10 1342 0.95 0.9540000 2 15 2912 0.96 0.95of fully connected layers, all containing the same amount of neurons (except forthe input and output layers). The networks were again optimized using Rangerby Wright (2019), using ReLU by Hahnloser et al. (2000) as the activationfunction. Weights were initialized following the method by He et al. (2015), andwe used mean squared error as the loss function.The number of layers and neurons per layer were systematically tested mul-tiple times. We consider a network to have learned the sorting task if it reachesa loss smaller than · − . We report the networks with the lowest numberof parameters. Unfortunately, we were not able to meaningfully search for net-works that sort more than five numbers. Sorting a list of numbers is in itself asurprisingly tricky problem to learn for networks and finding a solution is verysensitive to the weight initialization. Even with the reported smallest networks,we had to test the same architecture more than 50 times to be able to teachthe network to sort, and this problem became exponentially more difficult witheach new number. Thus, we were not able to perform a meaningful search forthe smallest networks sorting six numbers and above.Table 3 shows the smallest networks we could find through training. Theconfigurations in the table show the number of neurons per layer. As can beseen, the number of parameters of trained networks is strictly greater than thenumber of parameters needed by networks that are constructed by the methodpresented in Section 6.1 and the difference grows rapidly with the size of thevector to be sorted. To show that attention is useful for solving the identity task, we trained aResNet-18 architecture by He et al. (2016) (which was already pre-trained on16 able 3: Smallest neural networks for sorting that could be found by training, in comparisonto the smallest networks that can be constructed using the method from Section 6.1.
Numbers Learned Layer Parameters Parametersto Sort Configurations Learned Constructed3 3 7 6 3 97 724 4 7 7 7 7 4 235 1445 5 11 11 11 11 11 11 11 11 11 11 11 5 1446 340 (a) Without atten-tion (one channel) (b) With attention (three channels)Figure 11: Example images for the two tested versions of the identity task. In a) all patches arecombined into one channel. In b) the patches are presented to the system in different channels,which simulates pre-attention. In this case, a) is an example of the non-identity-class and b)of the identity class. The varying darkness of the images stems from the pre-processing forResNet. the ImageNet dataset) for two different variants of the task. In both cases, allimages contain three patches with a size of × . The images were generatedat a resolution of × and scaled up afterwards to the typical resolution forResNet of × . In addition, the background of the image is kept grey, tomake it easier for the system to detect which part belongs to a patch and whichdoes not. The images can be classified in the two classes “identity” if two of thethree patches are identical and “non-identity” if they are not.For the first variant , the three patches are presented as a regular image tothe neural network. The three patches are presented to the network in threeseparate channels in the second variant . See Figure 11 for example images ofthose two variants.The separation for the second variant, in essence, pre-attends the input, anobservation that Kim et al. (2018) already made. The authors were also able toshow that separating objects into separate channels vastly improves performanceon the SVRT dataset by Fleuret et al. (2011) and make the task very easy tosolve for CNNs.For our experiments, the amount of training data that was available to thesystem was not limited, and no care was taken to separate training from testingdata. Although the task looks easy on the surface, the neural network seemsto solve it mainly by memorization, since even restricting the training data to50.000 images leads to overfitting on the training data from the start and an17 × × × × . . . . . Seen samples L o ss / A cc u r a c y Training lossValidation lossValidation accuracy
Figure 12: Loss and accuracy during training of a ResNet-18, trained on data with pre-attention. accuracy on the validation set of about 0.6 (presumably because of some overlapof the training and validation set).ResNet-18 was trained on both variants until an accuracy of 0.98 was achievedon the validation set. Graphs of the losses and validation accuracy can be seenin Figures 13 and 12.Comparing the results with and without pre-attention shows that the taskbecomes much easier when pre-attention is used. With attention, around 500,000images are needed for an accuracy above 0.98, while the network without atten-tion needs around 3,000,000 images, which amounts to six times more trainingdata and training time.
8. Conclusions and Discussion
As we have shown, the reason that CNNs perform poorly on the identitytask is threefold:
First , the convolutional layers cannot perform much meaningful work, sincethe identity task is inherently global, but convolutions are inherently local. Thismismatch leaves most of the work for the fully connected part of the network,which operates on a global level.
Second , since CNNs usually do not provide an effective attention mechanism,the number of features forwarded to the fully connected layer is unnecessarilyhigh. At the level of the fully connected layers, the identity task can mostlybe reduced to a sorting problem. We used sorting networks to propose a lowerbound on the number of parameters a neural network needs to solve this task.18 × × × . . . . . Seen samples L o ss / A cc u r a c y Training lossValidation lossValidation accuracy
Figure 13: Loss and accuracy during training of a ResNet-18, trained on data without pre-attention.
Third , since most of the processing has to be done in fully connected layerswhich do not share weights, without recurrent connection or other forms ofiterative processing, the resulting system is also using data very inefficiently.This inefficiency can be seen by the fact that even only looking at the fullyconnected part of the problem, solving the identity task for ten numbers needs50,000 training samples to solve.All those arguments do not necessarily mean that smaller networks can notsolve the identity task, but they likely will not generalize well and only solvethe problem with memorization.In our opinion, two mechanisms are needed to solve comparison tasks ef-ficiently (regarding the number of parameters of the networks as well as theamount of training data needed). First, an attention mechanism to reduce thenumber of entities that need to be compared to prevent a combinatorial ex-plosion is needed. Second, the information extracted from these entities hasto be processed iteratively using a recurrent architecture to reduce the numberof parameters, as well as the amount of training data needed through weightsharing.
9. BibliographyReferences
Fleuret, F., Li, T., Dubout, C., Wampler, E. K., Yantis, S., & Geman, D. (2011).Comparing machines and humans on a visual categorization test.
Proceedingsof the National Academy of Sciences , , 17621–17625.19unke, C. M., Borowski, J., Stosio, K., Brendel, W., Wallis, T. S., & Bethge, M.(2020). The notorious difficulty of comparing human and machine perception. arXiv preprint arXiv:2004.09406 , .Hahnloser, R. H., Sarpeshkar, R., Mahowald, M. A., Douglas, R. J., & Seung,H. S. (2000). Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature , , 947–951.He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Sur-passing human-level performance on imagenet classification. In Proceedingsof the IEEE international conference on computer vision (pp. 1026–1034).He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for imagerecognition. In
Proceedings of the IEEE conference on computer vision andpattern recognition (pp. 770–778).Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.
Neuralcomputation , , 1735–1780.Kahale, N., Leighton, T., Ma, Y., Plaxton, C. G., Suel, T., & Szemerédi, E.(1995). Lower bounds for sorting networks. In Proceedings of the twenty-seventh annual ACM symposium on Theory of computing (pp. 437–446).ACM.Kim, J., Ricci, M., & Serre, T. (2018). Not-so-clevr: learning same–differentrelations strains feedforward neural networks.
Interface Focus , , 20180011.Kubilius, J., Schrimpf, M., Nayebi, A., Bear, D., Yamins, D. L., & DiCarlo, J. J.(2018). Cornet: modeling the neural mechanisms of core object recognition. BioRxiv , (p. 408385).Messina, N., Amato, G., Carrara, F., Falchi, F., & Gennaro, C. (2019). Test-ing deep neural networks on the same-different task. In (pp. 1–6). IEEE.O’connor, D., & Nelson, R. (1962). Sorting system with nu-line sortingswitch. URL: https://patents.google.com/patent/US3029413 uS Patent3,029,413.Ricci, M., Kim, J., & Serre, T. (2018). Same-different problems strain convolu-tional neural networks.
ArXiv180203390 Cs Q-Bio Available at: http://arxiv.org/abs/1802.03390 [Accessed May 28, 2018] , .Stabinger, S., Rodríguez-Sánchez, A., & Piater, J. (2016). 25 years ofcnns: Can we compare to human abstraction capabilities? arXiv preprintarXiv:1607.08366 , .Tsotsos, J. K. (1988). A ‘complexity level’analysis of immediate vision.
Inter-national Journal of Computer Vision , , 303–320.Wright, L. (2019). New deep learning optimizer, ranger: Synergistic combina-tion of radam + lookahead for the best of both. https://bit.ly/2CS4OVJhttps://bit.ly/2CS4OVJ