A case for robust translation tolerance in humans and CNNs. A commentary on Han et al
AA case for robust translation tolerance in humansand CNNs. A commentary on Han et al.
Ryan Blything , Valerio Biscione , and Jeffrey Bowers University of Bristol, School of Psychological Science, Bristol, BS81TU, United Kingdom * [email protected] Han et al. (2020) reported a behavioral experiment that assessed the ex-tent to which the human visual system can identify novel images at unseenretinal locations (what the authors call “intrinsic translation invariance”)and developed a novel convolutional neural network model (an Eccentric-ity Dependent Network or ENN) to capture key aspects of the behavioralresults. Here we show that their analysis of behavioral data used inappropri-ate baseline conditions, leading them to underestimate intrinsic translationinvariance. When the data are correctly interpreted they show near com-plete translation tolerance extending to 14 ◦ in some conditions, consistentwith earlier work (Bowers et al., 2016) and more recent work (Blything etal., in press). We describe a simpler model that provides a better accountof translation invariance.Han et al. examined intrinsic translation invariance in humans by requir-ing non-Korean participants to classify Korean letters as same or differentwhen a ‘target’ letter was first flashed for 33ms followed by a ‘test’ letter for33ms two seconds later. Target and test letters were shown at fixation or atvarying eccentricities, resulting in three translation conditions they labelled: Central (central target, peripheral test),
Peripheral (peripheral target, cen-tral test), and
Opposite (peripheral target, test at opposite peripheral side).They also assessed the impact of image size (scale) on translation invari-ance, with letters subtending 30’, 1 ◦ , or 2 ◦ . In all cases, performance inthese translation conditions was compared to a condition in which the tar-get and test letters were presented at fixation. As illustrated in Figure 1 (areproduction of their Figure 3), they observed substantially reduced perfor-mance in the Peripheral and Opposite conditions for all scales, and reduced1 a r X i v : . [ q - b i o . N C ] D ec erformance in the Central condition for the smaller stimuli. The authorstook these findings to highlight substantial limitations of intrinsic transla-tion invariance, especially for smaller stimuli. However, this conclusion ismistaken. The reduced performance across the translation conditions waslargely the product of poor visual acuity in peripheral vision.Figure 1: Taken from Han et al. (2020).
Windows of invariance for dif-ferent conditions. Same/different accuracy is shown in a color scale. Thecentral window (top) depicts the results when target letters were presentedat fixation and corresponding test letters in periphery at various scales andeccentricities. The peripheral window (bottom left) depicts the results whentarget letters were presented at various scales and eccentricities and corre-sponding test letters at fixation. The opposite window (bottom right) de-picts the results when target and test letters are presented at various scalesand eccentricities, with both target and test stimuli presented at the samedistance from the center but in the opposite side of the visual field (or bothstimuli presented at fixation at 0 degrees eccentricity). In all plots, the testedconditions are marked with circles and other data points are estimated usingnatural neighbor interpolation 2he standard procedure for assessing translation invariance while con-trolling for visual acuity is to compare performance at a fixed eccentricitywhen target/test stimuli are presented at same retinal location to perfor-mance in the opposite visual fields (e.g., Afraz & Cavanagh, 2008; Bieder-man & Cooper, 1991; Cox & DiCarlo, 2008; Dill & Edelman, 2001; Dill &Fahle, 1997; Dill & Fahle, 1998). In this way the target/test items haveequal acuity, with a translation of 0 ◦ in the former condition, and two timesthe eccentricity in the later condition. However, this is not what is plottedin Figure 1. Rather, performance at 0 ◦ reflects performance when targetand test stimuli are both presented at fixation where acuity is maximal,explaining the extremely high performance in the Central, Peripheral, andOpposite conditions at 0 ◦ for all image scales. In all other translation condi-tions, the target or test letter (or both) were presented in peripheral vision,leading to a confound of translation with acuity.Han et al. were aware of a possible confound with translation toleranceand visual acuity, but rejected acuity as the explanation for their findingsbased on their findings that Koreans participants (who were familiar withthe letters) were highly accurate at most translation conditions. We disagreewith this logic. Korean participants have learned to recognise degradedKorean letters in peripheral vision (e.g., when fixating on a central letterof a word the outer letters are projected in peripheral vision), and theirsuccess does not rule out acuity as the cause of the difficulty for non-Koreanparticipants.As noted above, critical data for assessing translation tolerance whilecontrolling for visual acuity is to compare performance at a fixed eccentric-ity when target/test stimuli are presented at same or opposite visual fields.These data are reported in Han et al.’s Figure S1 (Supplementary Material),and are replotted here as Figure 2. In all cases performance was similar whenthe letters were flashed in the same (solid blue line) and opposite (dashedyellow line) fields at a given eccentricity, with performance for the smallerstimuli reduced as a function of eccentricity. This shows that intrinsic trans-lation invariance was near complete, with performance on the task largelylimited by visual acuity. The near complete translation invarience extend-ing to 14 degrees for the largest Korean letters is consistent with (Bowerset al., 2016) who observed robust translation tolerance extending 13 degreesfor images of unfamiliar images of 2D shapes that subtended 5 degrees, and(Blything et al., in press) who reported robust translation tolerance extend-ing 18 degrees for images of unfamiliar 3D shapes that subtended 5 degrees.This reanalysis of the Han et al. (2020) dataset undermines their Ec-centricity Dependent Network (ENN) that was motivated to explain limited3igure 2: Han Behavioral Data. Comparison of same location (D → D) con-dition in blue vs. opposite location (D → OPP) condition in yellow, replottedfrom their supplementary data. .intrinsic translation invariance. Instead, the robust transation tolerance ismore consistent with the standard CNN models that Han et al. (2020) re-jected, although those models still failed to capture the overall drop in acuityat peripheral locations. With this in mind, we trained a classic CNN network(VGG16, Simonyan & Zisserman, 2014) on a dataset with injected peppernoise, probabilistically applied as a hyperbolic function of the distance to thecenter of the canvas. The noise was designed to mimic the reduced acuity inperipheral vision (Strasburger et al., 2011) (Fig 3a). Similarly to Han et al.,we pretrained a network with a (noisy) MNIST handwritten dataset on dif-ferent locations and scales (pretraining on translated objects is essential toobtain translation invariance on novel items, as shown in Biscione & Bowers,2020). We then re-trained this network on the set of Korean characters usedby Han et al., divided in two groups of 15 characters each (same charactersand grouping as in Figure 1 in their work, but we inverted the colours tokeep consistency with the MNIST dataset). The network objective was toclassify the characters from either of these groups. We used 4 different lettersizes: 10, 16, 22, and 28 pixels, on a 224x224 pixels canvas. Importantly,4e trained the network by displaying the items at only one location, andtested the network on the same (D → D) or opposite (D → Opp) location (Fig.3b-f). Notice that in the D → Opp condition, the network is queried on alocation in which it has never seen any Korean characters. By comparingD → D to D → Opp, we can infer the performance drop due uniquely to objectdisplacement (translation tolerance). Results in Figure 4 show that intrinsictranslation invariance is near perfect and bounded by visual acuity in theperiphery. The results succeed in qualitatively mimicking the relationshipbetween object size, translation, and accuracy, found in humans.One interesting finding from Han et al. that our CNN model does notexplain is the behavioral asymmetry between 0 → D and D →
0. It is im-portant to emphasize here that neither of these conditions assessed intrinsictranslation invariance (performance was confounded with visual acuity), andthe asymmetry may reflect properties of visual short-term memory ratherthan visual invariance. For example, it may be more difficult to maintaina highly degraded image of an unfamiliar Korean letter in STM, and thisselectively impaired performance in the same/different task when the firstletter (the target) was presented in peripheral vision and had to be stored for2 seconds. Previous behavioral studies that have avoided this confound withacuity have failed to observe this asymmetry (Bowers et al., 2016; Blythinget al., in press).In summary, Han et al. (2020) have misinterpreted their behavioral data,and when intrinsic translation invariance is correctly assessed it is near com-plete for all size stimuli and bounded by visual acuity. These results are con-sistent with previous work (Bowers et al., 2016; Blything et al., in press),and are broadly consistent with standard CNNs when they are given inputsthat capture human visual acuity. 5 a) (b) (c)(d) (e) (f)
Figure 3: (a) An example of randomized pepper noise used in our experi-ment, here applied on a white canvas. This is the same type of noise appliedto all images in our dataset. Notice how the amount of noise increases thefarther away from the center. (b-f) Examples of input images for the net-work, at different spatial locations and different sizes. (b-c) are objects ofsize 10 and 16 pixels, slightly displaced from the center (b is on the right,c on the left of the canvas center). Due to their small sizes, the stimulusare barely visible even with limited translation. (d-f) are example of objectswith larger sizes, more visible even when translated.6igure 4: (cid:26)
Results from the modelling experiment presented in this work. AVGG16 network is trained on peripherally degraded images along horizontallocations of the canvas. For each trained location, the network is then testedon the same location (D → D) or on the opposite location (D → Opp). Whenaccounting for limitation due to image degradation, we can observe that thenetwork is highly tolerant to translation to almost the whole canvas.7 eferences
Afraz, S.-R., & Cavanagh, P. (2008). Retinotopy of the face aftereffect.
Vision research , (1), 42–54.Biederman, I., & Cooper, E. E. (1991). Evidence for complete translationaland reflectional invariance in visual object priming. Perception , (5),585–593.Biscione, V., & Bowers, J. (2020). Learning Translation Invariance in CNNs. .Blything, R., Biscione, V., Vankov, I. I., Ludwig, C. J., & Bowers, J. (inpress). The human visual system and cnns can both support robust onlinetranslation tolerance following extreme displacements. Journal of Vision .Bowers, J. S., Vankov, I. I., & Ludwig, C. J. (2016). The visual system sup-ports online translation invariance for object identification.
Psychonomicbulletin & review , (2), 432–438.Cox, D. D., & DiCarlo, J. J. (2008). Does learned shape selectivity ininferior temporal cortex automatically generalize across retinal position? Journal of Neuroscience , (40), 10045–10055.Dill, M., & Edelman, S. (2001). Imperfect invariance to object translationin the discrimination of complex shapes. Perception , (6), 707–724.Dill, M., & Fahle, M. (1997). The role of visual field position in pattern–discrimination learning. Proceedings of the Royal Society of London. SeriesB: Biological Sciences , (1384), 1031–1036.Dill, M., & Fahle, M. (1998). Limited translation invariance of human visualpattern recognition. Perception & Psychophysics , (1), 65–81.Han, Y., Roig, G., Geiger, G., & Poggio, T. (2020). Scale and translation-invariance for novel objects in human vision. Scientific reports , (1),1–13.Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networksfor large-scale image recognition.Strasburger, H., Rentschler, I., & J¨uttner, M. (2011, 12). Peripheral visionand pattern recognition: A review. Journal of Vision , (5), 13-13. Re-trieved from https://doi.org/10.1167/11.5.13https://doi.org/10.1167/11.5.13