[PDF] Conditional Positional Encodings for Vision Transformers

Abstract

We propose a conditional positional encoding (CPE) scheme for vision Transformers. Unlike previous fixed or learnable positional encodings, which are pre-defined and independent of input tokens, CPE is dynamically generated and conditioned on the local neighborhood of the input tokens. As a result, CPE can easily generalize to the input sequences that are longer than what the model has ever seen during training. Besides, CPE can keep the desired translation-invariance in the image classification task, resulting in improved classification accuracy. CPE can be effortlessly implemented with a simple Position Encoding Generator (PEG), and it can be seamlessly incorporated into the current Transformer framework. Built on PEG, we present Conditional Position encoding Vision Transformer (CPVT). We demonstrate that CPVT has visually similar attention maps compared to those with learned positional encodings. Benefit from the conditional positional encoding scheme, we obtain state-of-the-art results on the ImageNet classification task compared with vision Transformers to date. Our code will be made available at this https URL .

Full PDF

DDo We Really Need Explicit Position Encodings for Vision Transformers?

Xiangxiang Chu , Bo Zhang , Zhi Tian , Xiaolin Wei , Huaxia Xia Meituan Inc., The University of Adelaide { chuxiangxiang,zhangbo97,weixiaolin02,xiahuaxia } @meituan.com , [email protected] Abstract

Almost all visual transformers such as ViT [14] or DeiT[41] rely on predeﬁned positional encodings to incorporatethe order of each input token. These encodings are oftenimplemented as learnable ﬁxed-dimension vectors or sinu-soidal functions of different frequencies, which are not pos-sible to accommodate variable-length input sequences. Thisinevitably limits a wider application of transformers in vi-sion, where many tasks require changing the input size on-the-ﬂy.In this paper, we propose to employ an implicit condi-tional position encodings scheme, which is conditioned onthe local neighborhood of the input token. It is effortlesslyimplemented as what we call Position Encoding Generator(PEG), which can be seamlessly incorporated into the cur-rent transformer framework. Our new model with PEG isnamed Conditional Position encodings Visual Transformer(CPVT) and can naturally process the input sequences ofarbitrary length. We demonstrate that CPVT can resultin visually similar attention maps and even better perfor-mance than those with predeﬁned positional encodings. Weobtain state-of-the-art results on the ImageNet classiﬁca-tion task compared with visual Transformers to date. Ourcode will be made available at https://github.com/Meituan-AutoML/CPVT .

1. Introduction

Recently, Transformer [42] has been viewed as a strongalternative to the convolutional neural networks (CNNs) invisual recognition tasks such as classiﬁcation [14] and de-tection [5, 51]. Unlike the convolution operation in CNNs,which has a limited and ﬁxed receptive ﬁeld, the self-attention mechanism in the transformers can capture thelong-distance information and dynamically adapts the re-ceptive ﬁeld according to the image content. As a result,the transformers are considered more ﬂexible and powerfulthan CNNs, thus being promising.However, the self-attention operation in transformers ispermutation-invariant, which cannot leverage the order of

Embedded PatchesL × EncoderMLP HeadEmbedded PatchesL × EncoderMLP Head

Positional

Encoding

Embedded Patches(L-1) × EncoderMLP Head PEGEncoder (a) ViT (b) DeiT w/o PE (c) CPVT

Figure 1: Visual Transformers: (a) with hardcoded posi-tional encoding [14] (b) without positional encoding (c)with the proposed Position Encoding Generator (PEG)plugin. Attention in (c) automatically perceives positionthrough learning.the tokens in an input sequence. To mitigate this gap, pre-vious works [42, 14] add an absolute positional encoding toeach token in the input sequence (see Figure 1a), which en-ables order-awareness. The positional encoding can eitherbe learnable or ﬁxed with sinusoidal functions of differentfrequencies. Despite being effective and non-invasive, theseexplicit positional encodings seriously harm the ﬂexibilityof the transformers, hampering their broader applications.Taking the learnable version as an example, the encodingsare often a vector of equal length to the input sequence,which are jointly updated with the model parameters dur-ing training. Thus, the length of the positional encodings isﬁxed once trained. This makes the model difﬁcult to han-dle the sequences with different lengths, particularly longerones.Failing to adapt to a variable input length during test-ing greatly limits its application because many vision tasks( e.g ., object detection) require changing the image size ( i.e .,input length in the transformer) on-the-ﬂy. A possible rem-edy is to use bicubic interpolation to upsample the posi-tional encodings to the target length, but it may degrade theperformance without ﬁne-tuning as later shown in our ex-periments. One may also use relative positional encodingsas in [35]. However, relative positional encodings not only a r X i v : . [ c s . C V ] F e b low down the training and testing, but also requires modi-fying the implementation of the standard transformers. Lastbut not least, the relative positional encodings cannot workequally well as the absolute ones, as also later shown inour experiments. We conjecture that it is because the imagerecognition task still requires absolute position information[22].In this work, we advocate a novel position encodingscheme to implicitly incorporate the position informationinto Transformer. Unlike the positional encodings used inprevious works [14, 42, 35] which are predeﬁned and input-agnostic, ours are generated on-the-ﬂy and conditioned onthe local neighborhood of an input token. We demonstratethat the Visual Transformer with this new encoding ( i.e .CPVT, see Figure 1c) can produce visually similar atten-tion maps and result in even better performance than theprevious vision transformers [14, 41].We summarize our contributions as follows.• We propose a novel position encoding scheme, termed conditional position encodings (CPE), which condi-tionally imbues Transformer with implicit positionalinformation on-the-ﬂy. By doing so, Transformersare unlocked to process input images of arbitrary sizewithout bicubic interpolation or ﬁne-tuning.• We demonstrate that positional encoding is crucial tovision transformers and it helps learn a schema of lo-cality . We empirically prove CPE helps learn localityinformation as well.• CPE is generated by so-called Positional EncodingGenerators (PEG), whose implementation is effort-less and doesn’t require tampering much with the cur-rent Transformer API compared with [35]. It is alsowidely supported in mainstream deep learning frame-works [29, 1, 8].• Our refurbished vision transformer with CPE is calledConditional Position encodings Visual Transformer(CPVT) and it achieves new state-of-the-art perfor-mance on ImageNet compared with prior arts [14, 41].

2. Related Work

Transformer [42] renovates language models after RNNs(typically in the form of GRU [9] and LSTM [16]), via anencoder-decoder architecture based on self-attention mech-anism and feed-forward networks. Due to its parallel-friendly computation, Transformer and its variants likeBERT [13] and GPT models [30, 31] can be pre-trainedon very large datasets which allow outstanding knowledgetransfer on many real-world language tasks. Remarkably,when pre-trained with an excessively large plaintext dataset, GPT-3 [4] even works on downstream tasks out-of-the-boxwithout any ﬁne-tuning.

Attention is shown to be able to replace convolution invision tasks [32]. It turns out that self-attention layers at-tend pixel-grid patterns similarly to CNN layers [10]. Mostrecently, attention-based transformers have gone viral in vi-sion.

Detection.

DETR [5] applies Transformer for objectdetection, effectively removing the need of non-max sup-pression and anchor generation. Deformable DETR [51] in-volves sampling-based attention to speed up DETR. TheseTransformer detectors [5, 51] use CNNs for feature extrac-tion, so the ﬁxed-length position encodings are applied.

Classiﬁcation.

After reducing image resolution andcolor space, iGPT [7] trains Transformer on pixels so thatit can later be ﬁnetuned for classiﬁcation tasks. Closely re-lated to our work, ViT [14] makes Transformers scalable forclassiﬁcation tasks. Speciﬁcally, it decomposes a 224 × ×

16 each),which are analogous to a sequence of words in languageprocessing. DeiT [41] is noted for its data efﬁciency and canbe directly trained on ImageNet, it largely eases the needfor pre-training models on the excessively large dataset asin [14].

Segmentation.

Transformer reassures its advantage ininstance segmentation [45], panoptic segmentation [43] andsemantic segmentation [49]. SETR [49] uses ViT encodersfor feature extraction and appends a multi-level feature ag-gregation module for segmentation.

Low-level vision.

Transformers are also proved pow-erful in many low-level vision tasks such as image gen-eration [28]. IPT [6] proﬁts from the multi-head featureto unify super-resolution, image denoising, and derainingwithin a single framework to develop an all-in-one model.

Attention is a mechanism to give importance to the mostrelevant part of an input signal. Convolution operates bothon spatial dimension and channel dimension within local re-ceptive ﬁelds. To capture information with global receptiveﬁelds, convolutions are typically stacked with downsam-pling and non-linear modules. SENet [18] explicitly mod-els channel-wise interdependencies with a squeeze-and-excitation module. CBAM [46] sequentially processes inchannel and spatial dimension to generate attention map.Self-attention is used to draw global dependencies betweeninput and output [42]. Similarly in non-local networks [44],a generalized form of self-attention called non-local oper-ation is developed to capture long-range dependencies intime.2 .4. Positional Encodings

Positional encodings are crucial to exploiting the orderof sequences. Convolution is found to implicitly encode ab-solute positions [23], especially zero padding and bordersact as anchors to derive spatial information. However, theself-attention mechanism in Transformer does not explic-itly model relative or absolute position information. To thisend, Transformer [42] adopts explicit absolute sinusoidalpositional encodings added into input embeddings. Coord-Conv [26] uses concatenation instead of addition. Rela-tive position encoding [35] considers distances between se-quence elements and proves to be beneﬁcial. A 2D rel-ative position encoding is proposed for image classiﬁca-tion in [3], showing superiority to 2D sinusoidal embed-dings. LambdaNetworks [2] proposes a lambda layer tomodel long-range content and position-based interactions,which bypasses the need for expensive quadratic attentionmaps. They ﬁnd that positional interactions are necessaryfor performance while content-based interactions only givemarginal improvement. These current positional encodingsare either inﬂexible to implement or unadaptable for vari-able input lengths, which motivates us for a more thoroughreview.

3. Vision Transformer with Conditional Posi-tion Encodings

Transformer [42] features a series of encoders and de-coders of an identical structure. Every encoder and decoderhas a multi-head self-attention layer (MHSA) and a feed-forward network layer (FFN), while each decoder has anextra attention layer to process the output of the encoder.

Self-attention.

An attention function on input x = { x , . . . , x n } is computed simultaneously on a set ofqueries Q with keys K and values V by the following, Att( x ) = softmax( Q · K (cid:62) √ d k ) · VQ = W Q x, K = W K x, V = W V x (1)where W Q , W K , and W V are weight matrices to generate Q , K , and V via linear transformations on x . And Q · K (cid:62) calculates attention score as the dot product of the query andall the keys, scaled by the dimension d k of keys K . Multi-head Self-Attention.

First, Q , K , and V are lin-early projected for h times with different learned weights.Then the self-attention function is applied in parallel to gen-erate h outputs, so-called heads . All heads are concatenatedto give the ﬁnal output, i.e ., MultiHead(

Q, K, V ) = Concat(head , ..., head h ) W O where head i = Att( QW Qi , KW Ki , V W Vi ) (2) Feed-forward network.

The attention output is typi-cally processed by a two-layer linear transformation withan activation in between,

F F N ( x ) = max (0 , W x + b ) W + b (3) Layer Normalization.

A residual function and a layernormalization is added for each of the sub-layers ( e.g ., at-tention layer and feed-forward layer) in every encoder anddecoder, i.e ., LayerNorm( x + Sublayer( x )) . Positional Encoding.

Transformer [42] adopts a sinu-soidal function to encode positions, e.g.

P E ( pos, i ) = sin( pos/ i/d model ) P E ( pos, i + 1) = cos( pos/ i/d model ) (4)where pos denotes the word position in the sequence, d model means the total encoding dimension, and i is the current di-mension. However, such absolution positions can be morenaturally encoded as relative positional encodings (RPE)[35] like following, Att( x i ) = n (cid:88) j =1 α ij ( W V x j + a Vij ) α ij = exp e ij (cid:80) nk =1 exp e ik e ij = x i W Q ( x j W K + a Kij ) T √ d k (5)where a ij ∈ R d k denotes edge distances between x i and x j when seeing input elements as a directed and fully-connected graph. RPE creates practical space complexityfor implementation, which require more efﬁcient version asin [19]. Moreover, it also requires substantial intrusion intostandard Transformer API. Vision Transformers are constrained by ﬁxed imagesizes.

Given an image of input size H × W , it is ﬂattenedinto patches with word size S × S , the number of patches isdetermined as N = HWS . A pre-trained transformer is pre-sumed to process images with the same size. Noticeably, allother components ( e.g . MHSA and FFN) of a vision trans-former can scale well with the spatial dimension except forthe position encoding as the position is directly related toinput size. S shall be divided by H and W . position encodings are crucial to Vision Trans-formers. Model Encoding Top-1 Acc(%) Top-5 Acc(%)DeiT-tiny [41] (cid:55)

Table 1: Comparison of various positional encoding (PE)strategies tested on ImageNet validation set. Removing PEgreatly damages the performance.The second one, relative position encodings [35], com-plicates the implementation and is less efﬁcient, which re-quires substantial structural intrusions into standard Trans-former functions (shown in Eq 5). Moreover, our controlledexperiment shows that it behaves worse with 70.5 % top-1accuracy (Table 1), which indicates absolute position infor-mation is required for better performance.The last one, interpolating the position encodings, is expedient for testing pre-trained vision Transformers forclassiﬁcation on higher-resolution images. However, un-der training scenarios in object detection and segmentationwhere input spatial dimensions change frequently, interpo-lation becomes infeasible.Taking the above discussions into account, we are drivento think of a new strategy that relaxes the limit on ﬁxedinput sizes and meanwhile imposes position encodings . To handle different input sizes, the new position encod-ings need to have variable lengths according to input tokens.This leads to implicit encodings, conditionally generated tomatch input sizes on-the-ﬂy.Additionally, a successful design should meet the follow-ing requirements ,(1) retaining strong performance, (2) avoiding permutation equivariance so that permutinginput orders have different responses,(3) efﬁcient and easy to implement under modern deeplearning frameworks, non-invasive to standard trans-former API.Before we proceed, we roughly review how to deﬁne ab-solute positions for a sequence of length N . One straight-forward approach is to assign a position for each element di-rectly. Another one is to deﬁne a reference point and to de-scribe the relationship within a local neighborhood. Whenrequested for the absolute position for a given element, wecan reconstruct the relative relations from its neighbors.Combining the relative information with the reference pointwe can derive the actual position. As per visual transform-ers, we can use a similar mechanism as the latter approachto implicitly deﬁne the positions for input patches.To build the relationship of local neighbors, we uti-lize the 2D characteristics of images and reshape the ﬂat-tened sequence X ∈ R B × N × C back to 2D space X (cid:48) ∈ R B × C × H × W . We then apply a 2D transformation F toimpose local regularization on X (cid:48) and reshape the out-put back to the sequence space, which we designate as X (cid:48)(cid:48) ∈ R B × N × C . Since the class token Y ∈ R B × C doesn’tinvolve position information, we keep it unchanged. Theoutput is formed by concatenating the unchanged class to-ken Y and regularized X (cid:48)(cid:48) along the last dimension.To construct the reference point, we utilize the boundarypadding of 2D features. A toy example is that when weperform convolutions on an image or its feature maps, zeropadding would indicate the position of the boundary point. Positional Encoding Generator.

There is a handy in-stantiation of F that meets requirements (2) and (3): a learn-able 2D convolution with kernel k ( k ≥ ) and k − zeropaddings. It only adds a minimum touch of Transformerimplementation. We call the overall module as a PositionalEncoding Generator (PEG), visualized in Figure 2. It issupposed to capture 2D position information that feeds intothe attention pipeline. By design, it supports ﬂexible scal-ing for various input spatial dimensions. We also give anexemplary implementation in Section B.1 (supplementary).In practice, however, F can be more versatile like separableconvolutions and many others.We then propose Conditional Position encodings VisualTransformer (CPVT) that inserts PEGs between encoders.Next, we empirically study its potential functionality. Weleave the requirement (1) to be veriﬁed in Section 4.2. Knowing that position encodings are crucial to Trans-formers, we are curious to ﬁnd out what it potentially doesto contribute better performance and whether CPVT does4 N ... class tokenfeature tokens E n c o d e r E n c o d e r PEG reshape

H W ... transform reshapeconcatenation

Figure 2: Schematic illustration of PEG. Note d is the em-bedding size, N is the number of tokens. The transforma-tion unit can be depth-wise, separable convolution or othercomplicated blocks.the same. We begin at comparing the failure case (w/o po-sition encoding) with DeiT [41].A good point to watch is the outcome. It has been shown[14] that lower layers in Transformer have shorter atten-tion distances on average. As the network depth increases,the attention distances are also enlarged. We investigatewhether the failure case without position encoding followsthis observation.Speciﬁcally, given a 224 ×

224 image (i.e. 14 × × a schema oflocality , where the diagonal element interacts strongly withits local neighbors while weakly with those far-away ele-ments . When position encoding is removed (DeiT w/o PE),it turns out that the patch develops much weaker interactionwith its neighbors. This leads us to hypothesize that thefailure might be caused by the weak attention to neighbor-ing elements. DeiT

DeiT w/o PE

Figure 3: Normalized scores from the second encoder blockof DeiT vs. without position encoding (DeiT w/o PE) [41]on the same input. With position encoding, DeiT developsa schema a locality in lower layers.

CPVT also learns locality information.

It turns outthat CPVT, just like DeiT [41], learns locality informationin the lower layers too, as shown by the normalized atten-tion map in Figure 4. It suggests that CPVT learns a similarpattern that DeiT managed to do under absolute encodings.Next, we evaluate whether this behavior helps for better per-formance through experiments.

Layer 2 - CPVT

Layer 2 - DeiT w/o PE

Figure 4: Comparison of the second layer attention mapsreshaped to 14 ×

14 grids) of CPVT vs. DeiT w/o PE.

4. Experiments

Dataset

We use ILSVRC-2012 ImageNet dataset [12]with 1k classes and 1.3M images to train all our modelsfollowing DeiT [41]. We report the results on the valida-tion set throughout the paper. We don’t use the much largerJFT-300M dataset [37], which is used in ViT [14] but undis-closed.

Model variants

We directly comply with the modelsetting variants as [41] and use three models with differ-ent throughputs to adapt for various computing scenarios.The detailed setting is shown in Table 2. This strategy leadsto controlled experiments to those recent SOTAs [14, 41].All experiments in this paper are performed on Tesla V100machines. Training the tiny model for 300 epochs takesabout 1.3 days on a single V100 with 8 GPU cards. Besides,CPVT-S and CPVT-B take about 1.6 and 2.5 days respec-tively. The added PEG plugin in CPVT models comes withneglectable costs, see Section 4.4.

Model embedding dimension

Table 2: CPVT architecture variants. The larger model,CPVT-B, has the same architecture as ViT-B [14] and DeiT-B [41]. CPVT-S and CPVT-Ti have the same architecture asDeiT-small and DeiT-tiny respectively.

Training details

All the models (except for CPVT-B)are trained for 300 epochs with a global batch size of 2048on Tesla V100 machines using AdamW optimizer [27]. Wedon’t tune the hyper-parameters and strictly comply withthe settings in DeiT [41]. The learning rate is scaled withthis formula lr scale = . ∗ BatchSize global . Although itmay be sub-optimal for our method, our approach can ob-tain competitive results compared with [41]. The detailedhyperparameters are shown in Table 13 (supplementary).5 .2. Comparison with State-of-the-art Methods

We evaluate the performance of CPVT models on theImageNet validation dataset and report the results in Ta-ble 3. Compared with DeiT,

CPVT modes have much bettertop-1 accuracy with similar throughputs . Since convolutionis already efﬁciently implemented in popular deep learningframeworks, CPVT is quite efﬁcient and has good through-put performance.

Models Params. Input throughput (cid:63)

Top-1 Acc (%)(img/s) @input @384 ∗ ConvnetsResNet-50 [15] 25M † CPVT-Ti (0-5) ‡ DeiT-small [41] 22M CPVT-S (0-5) ‡ DeiT-base [41] 86M (cid:63) : The throughput is measured on one 16GB V100 GPU as in [41]. ∗ : Directly tested on 384 ×

384 without ﬁne-tuning. † : Insert 4 PEGs after the ﬁrst encoder ‡ : Insert one PEG each after the ﬁrst encoder till the ﬁfth encoder Table 3: Comparison with ConvNets and Transformers onImageNet. CPVT models have much better performancecompared with prior Transformers, and also beneﬁt fromdirect scaling without ﬁne-tuning while DeiT degrades.

Direct scaling to a higher resolution.

The mostpopular method for scaling pre-trained visual Transformermodels to process higher resolution images is through sim-ple interpolation of position encodings [14, 41]. However,this interpolation potentially damages the performance. Toverify this, we took all the models ﬁrstly trained using theinput images of 224 ×

224 and tested based on a higher reso-lution of 384 ×

384 without ﬁne-tuning. The result is shownin the right-most column in Table 3. DeiT-tiny degradesfrom 72.2% to 71.2%. In contrast, as we removed explicitposition encodings, CPVT model can directly process arbi-trary size of images. As a result, CPVT-Ti’s performance isboosted from 73.4% to 74.2%. Therefore, the gap betweenDeiT-tiny and CPVT-Ti enlarged to 3.0%.The scaling result is overall promising since CPVTdoesn’t require any extra endeavors to achieve good scal-ability. In other words, it solely relies on the neural network itself to generalize to different spatial inputs . This resultresonates with the motivation of the conditional position en-codings scheme.

We further evaluate the detection performance of ourmethod on the COCO [25] dataset. Speciﬁcally, we onlychange the positional encoding strategy of the encoder part.We keep almost the same setting as [5] except that weshorten the training epochs from 500 to 50 since it takesabout 2000 GPU hours to train DETR for 500 epochs[5, 51]. We hope this setting will help the community toform a resource-efﬁcient baseline. Speciﬁcally, we trainDETR using AdamW [27] with a total batch size of 32 and0.0001 weight decay. The initial learning rate of the back-bone and transformer is 2 × − and 1 × − respectively.The learning rate is scheduled following the stepLR strategyand decay by 0.1 × at epoch 40. As for the loss function,there are three components: l loss for bounding box (5.0),classiﬁcation loss (1.0) and GIoU loss (2.0) [34].For DETR models, we use 100 object queries and don’tutilize Focal loss [24] to make fair comparison. Notethat both DETR and Deformable-DETR make use of 2Dsine positional encoding. Table 4 shows that if this en-coding is removed, the mAP of DETR degrades to 32.8%from 33.7%, which is consistent with the ablation study ofDETR. However, its performance can be improved to 33.9%if PEG is plugged in. As for Deformable-DETR, replacing2D since positional encoding with PEG obtains better per-formance under various settings. In summary, PEG helpslearn-able positional embedding outperform its human de-signed counterpart in the task of object detection. Method Epochs AP AP AP AP S AP M AP L params FPSFCOS [40] 36 41.0 59.8 44.1 26.2 44.6 52.2 - 23Faster R-CNN [33] 109 42.0 62.1 45.5 26.6 45.4 53.4 42M 26DETR [41] 500 42.0 62.4 44.2 20.5 45.8 61.1 41M 28DETR-DC5 500 43.3 63.1 45.9 22.5 47.3 61.1 41M 12DETR w/o PE (cid:63)

50 32.8 54.0 33.5 13.7 35.5 50.5 41M 28DETR (cid:63)

50 33.7 54.5 34.7 13.2 35.8 51.5 41M 28

DETR w/o PE + PEG

41M 28DD [51] 50 39.4 - - 20.6 43.0 55.0 34M 27

DD w/o PE + PEG

34M 27DETR-DC5 50 35.3 55.7 36.8 15.2 37.5 53.6 41M 12DD-DC5 [51] 50 41.5 61.5 44.8 24.1 45.3 56.0 34M 22

DD-DC5 w/o PE + PEG

34M 22 DD: Deformable DETR [51]. For DD, we always use single scale. (cid:63) reproduced results using the released code.

Table 4: Comparison on COCO 2017 val set. PEG is astrong alternative to 2D sine positional encoding.We give an example snippet of our implementation inSection B.2 (supplementary).

Neglectable Parameters.

Given the model dimen-sion d , the extra number of parameters introduce by PEG is6 lk if we choose l depth-wise convolutions with kernel k .Even if we use l separable convolutions, this value becomes l ( d + k d ) . When k = 3 and l = 1 , CPVT-Ti ( d = 192 )brings about 1728 parameters. Note that DeiT-tiny utilizeslearnable positional embedding with × ×

14 = 37632 parameters. Therefore, CPVT-Ti has 35904 fewer numberof parameters than DeiT-tiny. Even using 4 layers of sepa-rable convolutions, CPVT-Ti introduces − more parameters. Note that DeiT-tiny has 5.7M param-eters, therefore the added cost can be neglected. Neglectable FLOPs.

As per FLOPs, l layers of k × k depth-wise convolutions possesses the cost of × dlk .Taking the tiny model for example, it involves × × . M for the simple case k = 3 , l = 1 , which can beneglected considering the tiny model has 2.1G FLOPs.

5. Ablation Study

Regarding highly expensive computing cost (see Sec-tion 4.1), we use the tiny model (by default, only one PEGof depth-wise 3 × Previously, there were several types of commonly usedencodings: absolute positional encoding (e.g. sinusoidal[42]), relative positional encoding (RPE) [35] and learn-able encoding (LE) [13, 30]. These encodings are added tothe input patch tokens before the ﬁrst encoder block. Weﬁrst compare the proposed PEG with these strategies in Ta-ble 5. Notice we denote i - j as the inserted positions of PEGwhich start from i and end at j -1. Model PEG Pos Encoding Top-1(%) Top-5 (%)DeiT-tiny [41] - LE 72.2 91.0DeiT-tiny [41] - 2D sin-cos 72.3 91.0DeiT-tiny - 2D RPE 70.5 90.0CPVT-Ti 0-0 PEG 72.4 91.2CPVT-Ti 0-0 PEG + LE 72.9 91.4CPVT-Ti 0-0 4 × PEG + LE 72.9 91.4

CPVT-Ti

Table 5: Comparison of various encoding strategies. LE:learnable encoding. RPE: relative positional encodingDeiT-tiny obtains 72.2% with learnable absolute encod-ing. We also extend the sinusoidal encoding in Equation 4into 2D space to achieve on par performance. As for RPE,we follow [35] and set the local range hyper-parameter K as 8. This requires changing the self-attention formulationin Equation 5 and we obtain 70.5% top-1 accuracy.Moreover, we combine the learnable absolute encodingwith a single PEG. This boosts the baseline CPVT-Ti (0-0) by 0.5% . We attribute it to the limited representative ca-pacity of a single PEG, because if we use stack 4 PEG lay-ers it can achieve 72.9% and match the performance. More-over, if we add a single layer of PEG to the ﬁrst ﬁve blocks,we can obtain 73.4% top-1 accuracy, which indicates thatfeatures at different transformer blocks may have differentoptimal positions. We also experiment by varying the position of the PEGin the model. Table 6 presents the ablations for variable po-sitions based on the tiny model.

We denote the input of ﬁrstencoder by index -1.

Therefore, position 0 is the output ofthe ﬁrst encoder block. Our method shows strong perfor-mance ( ∼ Position Idx Top-1 Acc(%) Top-5 Acc(%)none 68.2 88.7-1 70.6 90.20

Table 6: Performance of different plugin positions using thearchitecture of DeiT-tiny on ImageNet.

Why is placing PEG at position -1 much worse than thatat 0? (Table 6). We observe that the largest difference be-tween the two is they have different receptive ﬁelds. Specif-ically, the former has a global ﬁeld while the latter only at-tends to a local area. Hence, it is supposed to work similarlywell if we enlarge the convolution range . To test our hy-pothesis, we use quite a large kernel size 27 with a paddingsize 13 at position -1, whose result is reported in Table 7. Itachieves 72.5% top-1 accuracy, which veriﬁes our assump-tion and thus can be regarded as another variant of CPVT.We don’t use it by default because it is a little slower andworse than placing PEG at 0.

PosIdx kernel Params Top-1 Acc(%) Top-5 Acc(%)-1 3 × ×

27 5.8M 72.5 91.3

Table 7: Performance of different kernels (position -1).

We further evaluate whether using multi-position en-codings beneﬁts the performance in Table 8. We followthe notations in Section 5.1. By inserting ﬁve positions, This setting cannot scale with spatial dimensions of input

Positions Model Params(M) Top-1 Acc(%) Top-5 Acc(%)0-0 tiny 5.7 72.4 91.20-5 tiny 5.9

Table 8: CPVT’s performance sensitivity to number of plu-gin positions on ImageNet validation dataset.

We further evaluate the performance of various regular-ization ranges determined by PEG kernel sizes. Speciﬁ-cally, we utilize a single depth-wise layer without batch nor-malization [21] or activation. We ﬁx the plugin position at0. Table 9 shows kernel 1 × the loss of positionalinformation . Moreover, there are no extra gains with largerkernel sizes, e.g. 5 and 7. Kernel Params(M) Top-1 Acc(%) Top-5 Acc(%)1 5.68 68.9 89.33 5.68

Table 9: Performance of different regularization ranges.

We design an experiment to quantify the importanceof the absolute positional encoding from zero paddings .Speciﬁcally, we use CPVT-S and simply remove suchpaddings from CPVT while keeping all other parts un-changed. This shortens the input sequence length from 196to 144. Table 10 shows that the tiny model only obtains70.5% if the padding is removed, which indicates that abso-lute positional information plays an important role in clas-sifying objects. This is the case because the category foreach image is mainly labeled by the object in the center.Transformer has to know which patch is in the center.

Model Padding Top-1 Acc(%) Top-5 Acc(%)CPVT-Ti (cid:88) (cid:55)

Table 10: ImageNet Performance w.r.t padding strategies.

Representational power?

One might suspect that theperformance improvement mainly comes from the repre-sentational power introduced by PEG. To disprove it, wesimply use a weak PEG: a 3 × ↑ ) thanone without any encoding (68.2%). Representational poweralone would not make such a big difference. As a com-parison, when we add 12 more fully connected layers ( i.e .,kernel size 1) with skip connections after the ﬁrst encoder,it brings about 0.5M more parameters. However, this onlyboosts the performance to 68.6% (0.4% ↑ ). Kernel Style Params(M) Top-1 Acc(%)none - 5.68 68.23 ﬁxed (random init) 5.68 71.33 ﬁxed (learned init) 5.68 72.31 (12 × ) learnable 6.13 68.63 learnable 5.68 Table 11: Ablation to decide the impact factor.

Positional Encoding?

Then, why can a single non-learnable depth-wise convolution make such a big differ-ence? We have seen that ﬁxed weight has slightly weakerperformance (71.3%) compared with the learnable baseline(72.4%). This suggests that the positional constraints im-posed by the ﬁxed-weight PEG is attenuated. The networkmay have more trouble in adapting its weights since the en-coding is not adaptive, i.e. less accurate.To prove the above hypothesis, we ﬁx a learned PEG in-stead of one with random initialization and train the tinyversion of the model from scratch. It achieves 72.3% top-1accuracy on ImageNet. Compared with the learnable base-line, the learned PEG already conveys most of the positionalinformation.Based on these aspects, we can conclude that it is theposition encoding that matters the most . In the next section,we design various encoding functions for a better analysis.

Table 12 shows the performance of PEG choices. Weinsert various types of PEGs at position 0. Depth-wise con-volution acts as a baseline with . accuracy. Direct re-placement for depth-wise convolution using a separable onecan marginally boost the performance with 0.1%. Stackingthree more layers of PEG (adding 0.1M more parameters)achieves . . We attribute its success to better encodingfrom more powerful representational capacity .8 tack Number Type Params(M) Top-1 Acc(%)1 × Depth-wise 5.7 72.41 × Separable 5.7 72.54 × Separable 5.8

Table 12: Ablation on the performance sensitivity to repre-sentation capacity. PEGs are at position 0.

6. Conclusion

In this paper, we’ve introduced CPVT, a novel methodto coerce position perception capability in vision Trans-formers. Through both theoretical and extensive experi-mental studies, we systematically analyze how importantpositional encoding is to vision Transformers. We discoverthat position information is crucial but does not necessar-ily have to be explicitly speciﬁed. A new out-of-box pluginis proposed to replace explicit hardcoded position encod-ings, which leads to strong performance with negligible ex-tra cost. With the gifted freedom of changing the input sizeon-the-ﬂy and plug-and-play nature, we look forward to abroader application of our proposed method in transformer-driven vision tasks like segmentation and video processing,or even in natural language processing.

References [1] Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen,Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe-mawat, Geoffrey Irving, Michael Isard, et al. Tensorﬂow: Asystem for large-scale machine learning. In { USENIX } symposium on operating systems design and implementation( { OSDI } , pages 265–283, 2016. 2[2] Irwan Bello. Lambdanetworks: Modeling long-range inter-actions without attention. In International Conference onLearning Representations , 2021. 3, 13[3] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens,and Quoc V Le. Attention augmented convolutional net-works. In

Proceedings of the IEEE/CVF International Con-ference on Computer Vision , pages 3286–3295, 2019. 3, 4[4] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Sub-biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan-tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.Language models are few-shot learners. arXiv preprintarXiv:2005.14165 , 2020. 2[5] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, NicolasUsunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In

European Confer-ence on Computer Vision , pages 213–229. Springer, 2020. 1,2, 6[6] Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, YipingDeng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, andWen Gao. Pre-trained image processing transformer. arXivpreprint arXiv:2012.00364 , 2020. 2[7] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee-woo Jun, David Luan, and Ilya Sutskever. Generative pre- training from pixels. In

International Conference on Ma-chine Learning , pages 1691–1703. PMLR, 2020. 2[8] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang,Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, andZheng Zhang. Mxnet: A ﬂexible and efﬁcient machinelearning library for heterogeneous distributed systems. arXivpreprint arXiv:1512.01274 , 2015. 2[9] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, andYoshua Bengio. Empirical evaluation of gated recurrentneural networks on sequence modeling. arXiv preprintarXiv:1412.3555 , 2014. 2[10] Jean-Baptiste Cordonnier, Andreas Loukas, and MartinJaggi. On the relationship between self-attention and con-volutional layers. In

International Conference on LearningRepresentations , 2020. 2[11] Ekin Dogus Cubuk, Barret Zoph, Jon Shlens, and Quoc Le.Randaugment: Practical automated data augmentation with areduced search space.

Advances in Neural Information Pro-cessing Systems , 33, 2020. 12[12] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. Imagenet: A large-scale hierarchical imagedatabase. In , pages 248–255. Ieee, 2009. 5[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and KristinaToutanova. Bert: Pre-training of deep bidirectional trans-formers for language understanding. In

Proceedings of the2019 Conference of the North American Chapter of the As-sociation for Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) , pages4171–4186, 2019. 2, 7[14] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image isworth 16x16 words: Transformers for image recognition atscale. In

International Conference on Learning Representa-tions , 2021. 1, 2, 4, 5, 6, 12[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In

Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016. 6, 13[16] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-termmemory.

Neural computation , 9(8):1735–1780, 1997. 2[17] Elad Hoffer, Tal Ben-Nun, Itay Hubara, Niv Giladi, TorstenHoeﬂer, and Daniel Soudry. Augment your batch: Improvinggeneralization through instance repetition. In

Proceedings ofthe IEEE/CVF Conference on Computer Vision and PatternRecognition , pages 8129–8138, 2020. 12[18] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net-works. In

Proceedings of the IEEE conference on computervision and pattern recognition , pages 7132–7141, 2018. 2[19] Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit,Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew MDai, Matthew D Hoffman, Monica Dinculescu, and DouglasEck. Music transformer. In

Advances in Neural ProcessingSystems , 2018. 3[20] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kil-ian Q Weinberger. Deep networks with stochastic depth. In uropean conference on computer vision , pages 646–661.Springer, 2016. 12[21] Sergey Ioffe and Christian Szegedy. Batch normalization:Accelerating deep network training by reducing internal co-variate shift. In International conference on machine learn-ing , pages 448–456. PMLR, 2015. 8[22] Md Amirul Islam, Sen Jia, and Neil DB Bruce. How muchposition information do convolutional neural networks en-code? In

International Conference on Learning Representa-tions , 2020. 2[23] Md Amirul Islam, Matthew Kowal, Sen Jia, Konstantinos GDerpanis, and Neil DB Bruce. Position, padding and predic-tions: A deeper look at position information in cnns. arXivpreprint arXiv:2101.12322 , 2021. 3[24] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, andPiotr Doll´ar. Focal loss for dense object detection. In

Pro-ceedings of the IEEE international conference on computervision , pages 2980–2988, 2017. 6[25] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C LawrenceZitnick. Microsoft coco: Common objects in context. In

European conference on computer vision , pages 740–755.Springer, 2014. 6[26] Rosanne Liu, Joel Lehman, Piero Molino, Felipe PetroskiSuch, Eric Frank, Alex Sergeev, and Jason Yosinski. Anintriguing failing of convolutional neural networks and thecoordconv solution. In

Advances in Neural Information Pro-cessing Systems , page 9628–9639, 2018. 3[27] Ilya Loshchilov and Frank Hutter. Decoupled weight de-cay regularization. In

International Conference on LearningRepresentations , 2019. 5, 6[28] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, LukaszKaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im-age transformer. In Jennifer Dy and Andreas Krause, ed-itors,

Proceedings of the 35th International Conference onMachine Learning , volume 80 of

Proceedings of MachineLearning Research , pages 4055–4064, Stockholmsm¨assan,Stockholm Sweden, 10–15 Jul 2018. PMLR. 2[29] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,James Bradbury, Gregory Chanan, Trevor Killeen, ZemingLin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im-perative style, high-performance deep learning library.

Ad-vances in Neural Information Processing Systems , 32:8026–8037, 2019. 2[30] Alec Radford, Karthik Narasimhan, Tim Salimans, and IlyaSutskever. Improving language understanding by generativepre-training. 2018. 2, 7[31] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, DarioAmodei, and Ilya Sutskever. Language models are unsuper-vised multitask learners.

OpenAI blog , 1(8):9, 2019. 2[32] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, IrwanBello, Anselm Levskaya, and Jon Shlens. Stand-alone self-attention in vision models. In H. Wallach, H. Larochelle,A. Beygelzimer, F. d'Alch´e-Buc, E. Fox, and R. Garnett, ed-itors,

Advances in Neural Information Processing Systems ,volume 32, pages 68–80. Curran Associates, Inc., 2019. 2 [33] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster r-cnn: towards real-time object detection with re-gion proposal networks. In

Proceedings of the 28th In-ternational Conference on Neural Information ProcessingSystems-Volume 1 , pages 91–99, 2015. 6[34] Hamid Rezatoﬁghi, Nathan Tsoi, JunYoung Gwak, AmirSadeghian, Ian Reid, and Silvio Savarese. Generalized in-tersection over union: A metric and a loss for bounding boxregression. In

Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition , pages 658–666,2019. 6[35] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In

Proceed-ings of the 2018 Conference of the North American Chap-ter of the Association for Computational Linguistics: HumanLanguage Technologies, Volume 2 , pages 464–468, 2018. 1,2, 3, 4, 7[36] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, IlyaSutskever, and Ruslan Salakhutdinov. Dropout: a simple wayto prevent neural networks from overﬁtting.

The journal ofmachine learning research , 15(1):1929–1958, 2014. 12[37] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhi-nav Gupta. Revisiting unreasonable effectiveness of data indeep learning era. In

Proceedings of the IEEE internationalconference on computer vision , pages 843–852, 2017. 5[38] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, JonShlens, and Zbigniew Wojna. Rethinking the inception archi-tecture for computer vision. In

Proceedings of the IEEE con-ference on computer vision and pattern recognition , pages2818–2826, 2016. 12[39] Mingxing Tan and Quoc Le. Efﬁcientnet: Rethinking modelscaling for convolutional neural networks. In

InternationalConference on Machine Learning , pages 6105–6114. PMLR,2019. 6[40] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos:Fully convolutional one-stage object detection. In

Proceed-ings of the IEEE/CVF International Conference on Com-puter Vision , pages 9627–9636, 2019. 6[41] Hugo Touvron, Matthieu Cord, Matthijs Douze, FranciscoMassa, Alexandre Sablayrolles, and Herv´e J´egou. Trainingdata-efﬁcient image transformers & distillation through at-tention. arXiv preprint arXiv:2012.12877 , 2020. 1, 2, 4, 5,6, 7, 12[42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and IlliaPolosukhin. Attention is all you need. In

Proceedings of the31st International Conference on Neural Information Pro-cessing Systems , pages 6000–6010, 2017. 1, 2, 3, 4, 7[43] Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille,and Liang-Chieh Chen. Max-deeplab: End-to-end panop-tic segmentation with mask transformers. arXiv preprintarXiv:2012.00759 , 2020. 2[44] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-ing He. Non-local neural networks. In

Proceedings of theIEEE conference on computer vision and pattern recogni-tion , pages 7794–7803, 2018. 2[45] Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen,Baoshan Cheng, Hao Shen, and Huaxia Xia. End-to- nd video instance segmentation with transformers. arXivpreprint arXiv:2011.14503 , 2020. 2[46] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In SoKweon. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision(ECCV) , pages 3–19, 2018. 2[47] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, SanghyukChun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regular-ization strategy to train strong classiﬁers with localizable fea-tures. In

Proceedings of the IEEE/CVF International Con-ference on Computer Vision , pages 6023–6032, 2019. 12[48] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, andDavid Lopez-Paz. mixup: Beyond empirical risk minimiza-tion. In

International Conference on Learning Representa-tions , 2018. 12[49] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu,Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, TaoXiang, Philip HS Torr, et al. Rethinking semantic segmen-tation from a sequence-to-sequence perspective with trans-formers. arXiv preprint arXiv:2012.15840 , 2020. 2[50] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, andYi Yang. Random erasing data augmentation. In

Proceedingsof the AAAI Conference on Artiﬁcial Intelligence , volume 34,pages 13001–13008, 2020. 12[51] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang,and Jifeng Dai. Deformable detr: Deformable transformersfor end-to-end object detection. In

International Conferenceon Learning Representations , 2021. 1, 2, 6 . Experiment Details A.1. Hyperparameter settings

Table 13 gives the hyper-parameter details of CPVT.

Methods ViT-B [14] DeiT-B [41] CPVTEpochs 300 300 300Batch size 4096 1024 1024Optimizer AdamW AdamW AdamWLearning rate decay cosine cosine cosineWeight decay 0.3 0.05 0.05Warmup epochs 3.4 5 5Label smoothing ε [38] (cid:55) (cid:55) (cid:55) Stoch. Depth [20] (cid:55) (cid:55) (cid:51) (cid:51)

Gradient Clip. (cid:51) (cid:55) (cid:55)

Rand Augment [11] (cid:55) (cid:55) (cid:55) (cid:55)

Table 13: Hyper-parameters for ViT-B, DeiT-B and CPVT.

B. Example Code

B.1. PEG

In the simplest form, we use a single depth-wise convo-lution and show its usage in Transformer by the followingPyTorch snippet. Through experiments, we ﬁnd that sucha simple design ( i.e ., depth-wise 3 ×

3) readily achieves onpar or even better performance than the recent SOTAs.

B.2. PEG for Detection

Note that the masked padding should be carefully dealtwith to avoid wrong gradient. The self attention componentin most standard libraries supports masked tokens. PEGcan efﬁciently attend to the masked padding by using somebasic tensor operations as following.

C. Under the Hood: Why is the Encoding Con-ditional?

We further study the underlying working mechanism ofCPVT. Without loss of generalization, we ignore the batchdimension and use the model dim as 1. We denote the out-put sequence of the ﬁrst encoder as X = ( x , x , ..., x N ) and N = H . We can write the convolution weight W as,  w − k, − k · · · w − k, · · · w − k,k ... ... ... w , − k · · · w , · · · w ,k ... ... ... w k, − k · · · w k, · · · w k,k  Algorithm 1

PyTorch snippet of PEG. import torchimport torch.nn as nnclass VisionTransformer:def __init__(layers=12, dim=192, nhead=3, img_size=224, patch_size=16):self.pos_block = PEG(dim)self.blocks = nn.ModuleList([TransformerEncoderLayer(dim, nhead, dim*4) for_ in range(layers)])self.patch_embed = PatchEmbed(img_size, patch_size,dim*4)def forward_features(self, x):B, C, H, W = x.shapex, patch_size = self.patch_embed(x)_H, _W = H // patch_size, W // patch_sizex = torch.cat((self.cls_tokens, x), dim=1)for i, blk in enumerate(self.blocks):x = blk(x)if i == 0:x = self.pos_block(x, _H, _W)return x[:, 0]class PEG(nn.Module):def __init__(self, dim=256, k=3):self.proj = nn.Conv2d(dim, dim, k, 1, k//2, groups=dim)

Algorithm 2

PyTorch snippet of PEG for detection. from torch import nnclass PEGDetection(nn.Module):def __init__(self, in_chans):super(PEGDetection, self).__init__()self.proj = nn.Sequential(nn.Conv2d(in_chans,in_chans, 3, 1, 1, bias=False, groups=in_chans), nn.BatchNorm2d(in_chans), nn.ReLU())def forward(self, x, mask, H, W):"""x N, B, C ; mask B N"""_, B, C = x.shape_tmp = x.transpose(0, 1)[mask]x = x.permute(1, 2, 0).view(B, C, H, W)x = x + self.proj(cnn_feat)x = x.flatten(2).transpose(1, 2)x[mask] = _tmpreturn x.transpose(0, 1)

We deﬁne the output of this convolution as Y =( y , y , ..., y N ) = F ( X ) . We deﬁne the mapping y m/H,m % H = y m and x m/H,m % H = x m . The transformfunction of X can be formulated as y m = x m + k (cid:88) i = − k k (cid:88) j = − k x m/H + i,m % H + j w i,j (6)For simplicity, we degrade the projection weight matri-ces to three scalars w q , w k , w v and we ignore multi-heads.12he self-attention function for z m can then be written as z m = N (cid:88) n =1 e w q w k y m y n (cid:80) Nl =1 e w q w k y m y l w v y m (7)Substituting Eq 6 into Eq 7, we can derive that y m y n = ( x m + k (cid:88) i = − k k (cid:88) j = − k x m/H + i,m % H + j w i,j ) × ( x n + k (cid:88) p = − k k (cid:88) q = − k x n/H + p,n % H + q w p,q )= x m x n + x m k (cid:88) p = − k k (cid:88) q = − k x n/H + p,n % H + q w p,q + x n k (cid:88) i = − k k (cid:88) j = − k x m/H + i,m % H + j w i,j + k (cid:88) i = − k k (cid:88) j = − k k (cid:88) p = − k k (cid:88) q = − k x m/H + i,m % H + j x n/H + p,n % H + q w p,q w i,j (8) From the perspective of encoding, CPVT can be re-garded as a conditional encoding approach if we considerthe transformation function F as a function to generate en-codings.Note that we add k zero paddings to make sure Y hasthe same length as X . This is reﬂected by variables in theboundary position such as x − , − in Eq 8. This differencemay bring absolute positional information.If y m and y n are near to each other within the kernelrange, there is a high probability that the same goes for theelements of X . The dot production (generalized in high di-mension) will contribute to a positive attention score, whichis weighted by the learnable w . This mechanism resemblesrelative position encoding in that it processes relative in-formation. Therefore, the conditional encoding can also beregarded as a mixed encoding mechanism . D. More Analysis

D.1. Comparison to Lambda Networks

Our work is also related to Lambda Networks [2] whichuses 2D relative positional encodings. We evaluate itslambda module with an embedding size of 128, where wedenote its encoding scheme as RPE2D-d128. Noticeably,this conﬁguration has about 5.9M parameters (comparableto DeiT-tiny) but only obtains 68.7%. We attribute its fail-ure to the limited ability in capturing the correct positionalinformation. After all, lambda layers are designed with thehelp of many CNN backbones components such as down-sampling to form various stages, to replace ordinary convo-lutions in ResNet [15]. In contrast, CPVT is transformer-based.

D.2. Does each encoder need positional informa-tion?

We have shown that positional information is critical tovisual transformers. A natural question is to know whetherthe position information is necessary for all the blocks. Toverify this, we retain positional information within the ﬁrstencoder and stop its forward propagation to the rest of theencoders. Speciﬁcally, we only inject learnable positionalencodings into the query and the keys of the ﬁrst encoder inthe DeiT-tiny model. If blocks after the ﬁrst one don’t re-quire such information, the performance should be on parwith 72.2% top-1 accuracy on ImageNet. However, thisgroup only obtains 71.4%.

E. Figures

E.1. DeIT

Figure 5 presents normalized attention maps from lowerattention layers of DeiT where it learns locality information.

Layer 1 - DeiT

Layer 2 - DeiT

Figure 5: First and second layer attention map of DeIT(same input as PoVT in Figure 4 in main text ).

E.2. CPVT

Figure 6 gives the normalized attention map of CPVT vs.DeiT w/o PE.

CPVT

DeiT w/o PE

Figure 6: Normalized scores from the second encoder block(196 × .30.20.10.00.10.20.30.4 Figure 7: 192 3 × Embedded Patches(L-1) × EncoderMLP Head PEGEncoder Embedded PatchesL × EncoderMLP Head PEG (a) default CPVT (b) CPVT at position -1

Embedded PatchesL × EncoderMLP Head PEG × × ×27