[PDF] A case for robust translation tolerance in humans and CNNs. A commentary on Han et al

Abstract

Han et al. (2020) reported a behavioral experiment that assessed the extent to which the human visual system can identify novel images at unseen retinal locations (what the authors call "intrinsic translation invariance") and developed a novel convolutional neural network model (an Eccentricity Dependent Network or ENN) to capture key aspects of the behavioral results. Here we show that their analysis of behavioral data used inappropriate baseline conditions, leading them to underestimate intrinsic translation invariance. When the data are correctly interpreted they show near complete translation tolerance extending to 14° in some conditions, consistent with earlier work (Bowers et al., 2016) and more recent work Blything et al. (in press). We describe a simpler model that provides a better account of translation invariance.

Full PDF

AA case for robust translation tolerance in humansand CNNs. A commentary on Han et al.

Ryan Blything , Valerio Biscione , and Jeﬀrey Bowers University of Bristol, School of Psychological Science, Bristol, BS81TU, United Kingdom * [email protected] Han et al. (2020) reported a behavioral experiment that assessed the ex-tent to which the human visual system can identify novel images at unseenretinal locations (what the authors call “intrinsic translation invariance”)and developed a novel convolutional neural network model (an Eccentric-ity Dependent Network or ENN) to capture key aspects of the behavioralresults. Here we show that their analysis of behavioral data used inappropri-ate baseline conditions, leading them to underestimate intrinsic translationinvariance. When the data are correctly interpreted they show near com-plete translation tolerance extending to 14 ◦ in some conditions, consistentwith earlier work (Bowers et al., 2016) and more recent work (Blything etal., in press). We describe a simpler model that provides a better accountof translation invariance.Han et al. examined intrinsic translation invariance in humans by requir-ing non-Korean participants to classify Korean letters as same or diﬀerentwhen a ‘target’ letter was ﬁrst ﬂashed for 33ms followed by a ‘test’ letter for33ms two seconds later. Target and test letters were shown at ﬁxation or atvarying eccentricities, resulting in three translation conditions they labelled: Central (central target, peripheral test),

Peripheral (peripheral target, cen-tral test), and

Opposite (peripheral target, test at opposite peripheral side).They also assessed the impact of image size (scale) on translation invari-ance, with letters subtending 30’, 1 ◦ , or 2 ◦ . In all cases, performance inthese translation conditions was compared to a condition in which the tar-get and test letters were presented at ﬁxation. As illustrated in Figure 1 (areproduction of their Figure 3), they observed substantially reduced perfor-mance in the Peripheral and Opposite conditions for all scales, and reduced1 a r X i v : . [ q - b i o . N C ] D ec erformance in the Central condition for the smaller stimuli. The authorstook these ﬁndings to highlight substantial limitations of intrinsic transla-tion invariance, especially for smaller stimuli. However, this conclusion ismistaken. The reduced performance across the translation conditions waslargely the product of poor visual acuity in peripheral vision.Figure 1: Taken from Han et al. (2020).

Windows of invariance for dif-ferent conditions. Same/diﬀerent accuracy is shown in a color scale. Thecentral window (top) depicts the results when target letters were presentedat ﬁxation and corresponding test letters in periphery at various scales andeccentricities. The peripheral window (bottom left) depicts the results whentarget letters were presented at various scales and eccentricities and corre-sponding test letters at ﬁxation. The opposite window (bottom right) de-picts the results when target and test letters are presented at various scalesand eccentricities, with both target and test stimuli presented at the samedistance from the center but in the opposite side of the visual ﬁeld (or bothstimuli presented at ﬁxation at 0 degrees eccentricity). In all plots, the testedconditions are marked with circles and other data points are estimated usingnatural neighbor interpolation 2he standard procedure for assessing translation invariance while con-trolling for visual acuity is to compare performance at a ﬁxed eccentricitywhen target/test stimuli are presented at same retinal location to perfor-mance in the opposite visual ﬁelds (e.g., Afraz & Cavanagh, 2008; Bieder-man & Cooper, 1991; Cox & DiCarlo, 2008; Dill & Edelman, 2001; Dill &Fahle, 1997; Dill & Fahle, 1998). In this way the target/test items haveequal acuity, with a translation of 0 ◦ in the former condition, and two timesthe eccentricity in the later condition. However, this is not what is plottedin Figure 1. Rather, performance at 0 ◦ reﬂects performance when targetand test stimuli are both presented at ﬁxation where acuity is maximal,explaining the extremely high performance in the Central, Peripheral, andOpposite conditions at 0 ◦ for all image scales. In all other translation condi-tions, the target or test letter (or both) were presented in peripheral vision,leading to a confound of translation with acuity.Han et al. were aware of a possible confound with translation toleranceand visual acuity, but rejected acuity as the explanation for their ﬁndingsbased on their ﬁndings that Koreans participants (who were familiar withthe letters) were highly accurate at most translation conditions. We disagreewith this logic. Korean participants have learned to recognise degradedKorean letters in peripheral vision (e.g., when ﬁxating on a central letterof a word the outer letters are projected in peripheral vision), and theirsuccess does not rule out acuity as the cause of the diﬃculty for non-Koreanparticipants.As noted above, critical data for assessing translation tolerance whilecontrolling for visual acuity is to compare performance at a ﬁxed eccentric-ity when target/test stimuli are presented at same or opposite visual ﬁelds.These data are reported in Han et al.’s Figure S1 (Supplementary Material),and are replotted here as Figure 2. In all cases performance was similar whenthe letters were ﬂashed in the same (solid blue line) and opposite (dashedyellow line) ﬁelds at a given eccentricity, with performance for the smallerstimuli reduced as a function of eccentricity. This shows that intrinsic trans-lation invariance was near complete, with performance on the task largelylimited by visual acuity. The near complete translation invarience extend-ing to 14 degrees for the largest Korean letters is consistent with (Bowerset al., 2016) who observed robust translation tolerance extending 13 degreesfor images of unfamiliar images of 2D shapes that subtended 5 degrees, and(Blything et al., in press) who reported robust translation tolerance extend-ing 18 degrees for images of unfamiliar 3D shapes that subtended 5 degrees.This reanalysis of the Han et al. (2020) dataset undermines their Ec-centricity Dependent Network (ENN) that was motivated to explain limited3igure 2: Han Behavioral Data. Comparison of same location (D → D) con-dition in blue vs. opposite location (D → OPP) condition in yellow, replottedfrom their supplementary data. .intrinsic translation invariance. Instead, the robust transation tolerance ismore consistent with the standard CNN models that Han et al. (2020) re-jected, although those models still failed to capture the overall drop in acuityat peripheral locations. With this in mind, we trained a classic CNN network(VGG16, Simonyan & Zisserman, 2014) on a dataset with injected peppernoise, probabilistically applied as a hyperbolic function of the distance to thecenter of the canvas. The noise was designed to mimic the reduced acuity inperipheral vision (Strasburger et al., 2011) (Fig 3a). Similarly to Han et al.,we pretrained a network with a (noisy) MNIST handwritten dataset on dif-ferent locations and scales (pretraining on translated objects is essential toobtain translation invariance on novel items, as shown in Biscione & Bowers,2020). We then re-trained this network on the set of Korean characters usedby Han et al., divided in two groups of 15 characters each (same charactersand grouping as in Figure 1 in their work, but we inverted the colours tokeep consistency with the MNIST dataset). The network objective was toclassify the characters from either of these groups. We used 4 diﬀerent lettersizes: 10, 16, 22, and 28 pixels, on a 224x224 pixels canvas. Importantly,4e trained the network by displaying the items at only one location, andtested the network on the same (D → D) or opposite (D → Opp) location (Fig.3b-f). Notice that in the D → Opp condition, the network is queried on alocation in which it has never seen any Korean characters. By comparingD → D to D → Opp, we can infer the performance drop due uniquely to objectdisplacement (translation tolerance). Results in Figure 4 show that intrinsictranslation invariance is near perfect and bounded by visual acuity in theperiphery. The results succeed in qualitatively mimicking the relationshipbetween object size, translation, and accuracy, found in humans.One interesting ﬁnding from Han et al. that our CNN model does notexplain is the behavioral asymmetry between 0 → D and D →

0. It is im-portant to emphasize here that neither of these conditions assessed intrinsictranslation invariance (performance was confounded with visual acuity), andthe asymmetry may reﬂect properties of visual short-term memory ratherthan visual invariance. For example, it may be more diﬃcult to maintaina highly degraded image of an unfamiliar Korean letter in STM, and thisselectively impaired performance in the same/diﬀerent task when the ﬁrstletter (the target) was presented in peripheral vision and had to be stored for2 seconds. Previous behavioral studies that have avoided this confound withacuity have failed to observe this asymmetry (Bowers et al., 2016; Blythinget al., in press).In summary, Han et al. (2020) have misinterpreted their behavioral data,and when intrinsic translation invariance is correctly assessed it is near com-plete for all size stimuli and bounded by visual acuity. These results are con-sistent with previous work (Bowers et al., 2016; Blything et al., in press),and are broadly consistent with standard CNNs when they are given inputsthat capture human visual acuity. 5 a) (b) (c)(d) (e) (f)

Figure 3: (a) An example of randomized pepper noise used in our experi-ment, here applied on a white canvas. This is the same type of noise appliedto all images in our dataset. Notice how the amount of noise increases thefarther away from the center. (b-f) Examples of input images for the net-work, at diﬀerent spatial locations and diﬀerent sizes. (b-c) are objects ofsize 10 and 16 pixels, slightly displaced from the center (b is on the right,c on the left of the canvas center). Due to their small sizes, the stimulusare barely visible even with limited translation. (d-f) are example of objectswith larger sizes, more visible even when translated.6igure 4: (cid:26)

Results from the modelling experiment presented in this work. AVGG16 network is trained on peripherally degraded images along horizontallocations of the canvas. For each trained location, the network is then testedon the same location (D → D) or on the opposite location (D → Opp). Whenaccounting for limitation due to image degradation, we can observe that thenetwork is highly tolerant to translation to almost the whole canvas.7 eferences

Afraz, S.-R., & Cavanagh, P. (2008). Retinotopy of the face aftereﬀect.

Vision research , (1), 42–54.Biederman, I., & Cooper, E. E. (1991). Evidence for complete translationaland reﬂectional invariance in visual object priming. Perception , (5),585–593.Biscione, V., & Bowers, J. (2020). Learning Translation Invariance in CNNs. .Blything, R., Biscione, V., Vankov, I. I., Ludwig, C. J., & Bowers, J. (inpress). The human visual system and cnns can both support robust onlinetranslation tolerance following extreme displacements. Journal of Vision .Bowers, J. S., Vankov, I. I., & Ludwig, C. J. (2016). The visual system sup-ports online translation invariance for object identiﬁcation.

Psychonomicbulletin & review , (2), 432–438.Cox, D. D., & DiCarlo, J. J. (2008). Does learned shape selectivity ininferior temporal cortex automatically generalize across retinal position? Journal of Neuroscience , (40), 10045–10055.Dill, M., & Edelman, S. (2001). Imperfect invariance to object translationin the discrimination of complex shapes. Perception , (6), 707–724.Dill, M., & Fahle, M. (1997). The role of visual ﬁeld position in pattern–discrimination learning. Proceedings of the Royal Society of London. SeriesB: Biological Sciences , (1384), 1031–1036.Dill, M., & Fahle, M. (1998). Limited translation invariance of human visualpattern recognition. Perception & Psychophysics , (1), 65–81.Han, Y., Roig, G., Geiger, G., & Poggio, T. (2020). Scale and translation-invariance for novel objects in human vision. Scientiﬁc reports , (1),1–13.Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networksfor large-scale image recognition.Strasburger, H., Rentschler, I., & J¨uttner, M. (2011, 12). Peripheral visionand pattern recognition: A review. Journal of Vision , (5), 13-13. Re-trieved from https://doi.org/10.1167/11.5.13https://doi.org/10.1167/11.5.13