[PDF] Attentive WaveBlock: Complementarity-enhanced Mutual Networks for Unsupervised Domain Adaptation in Person Re-identification and Beyond

Abstract

Unsupervised domain adaptation (UDA) for person re-identification is challenging because of the huge gap between the source and target domain. A typical self-training method is to use pseudo-labels generated by clustering algorithms to iteratively optimize the model on the target domain. However, a drawback to this is that noisy pseudo-labels generally cause trouble in learning. To address this problem, a mutual learning method by dual networks has been developed to produce reliable soft labels. However, as the two neural networks gradually converge, their complementarity is weakened and they likely become biased towards the same kind of noise. This paper proposes a novel light-weight module, the Attentive WaveBlock (AWB), which can be integrated into the dual networks of mutual learning to enhance the complementarity and further depress noise in the pseudo-labels. Specifically, we first introduce a parameter-free module, the WaveBlock, which creates a difference between features learned by two networks by waving blocks of feature maps differently. Then, an attention mechanism is leveraged to enlarge the difference created and discover more complementary features. Furthermore, two kinds of combination strategies, i.e. pre-attention and post-attention, are explored. Experiments demonstrate that the proposed method achieves state-of-the-art performance with significant improvements on multiple UDA person re-identification tasks. We also prove the generality of the proposed method by applying it to vehicle re-identification and image classification tasks. Our codes and models are available at this https URL.

Full PDF

AAttentive WaveBlock: Complementarity-enhancedMutual Networks for Unsupervised DomainAdaptation in Person Re-identiﬁcation

Wenhao Wang ∗ School of Mathematical SciencesBeihang UniversityBeijing, China [email protected]

Fang Zhao ∗† Inception Institute of Artiﬁcial IntelligenceAbu Dhabi, UAE [email protected]

Shengcai Liao

Inception Institute of Artiﬁcial IntelligenceAbu Dhabi, UAE [email protected]

Ling Shao

Inception Institute of Artiﬁcial IntelligenceAbu Dhabi, UAE [email protected]

Abstract

Unsupervised domain adaptation (UDA) for person re-identiﬁcation is challengingbecause of the huge gap between the source and target domain. A typical self-training method is to use pseudo-labels generated by clustering algorithms toiteratively optimize the model on the target domain. However, a drawback to thisis that noisy pseudo-labels generally cause troubles in learning. To address thisproblem, a mutual learning method by dual networks has been developed to producereliable soft labels. However, as the two neural networks gradually converge,their complementarity is weakened and they likely become biased towards thesame kind of noise. In this paper, we propose a novel light-weight module, theAttentive WaveBlock (AWB), which can be integrated into the dual networksof mutual learning to enhance the complementarity and further depress noise inthe pseudo-labels. Speciﬁcally, we ﬁrst introduce a parameter-free module, theWaveBlock, which creates a difference between two networks by waving blocksof feature maps differently. Then, an attention mechanism is leveraged to enlargethe difference created and discover more complementary features. Furthermore,two kinds of combination strategies, i.e. pre-attention and post-attention, areexplored. Experiments demonstrate that the proposed method achieves state-of-the-art performance with signiﬁcant improvements of . , . , . , and . inmAP on Duke-to-Market, Market-to-Duke, Duke-to-MSMT, and Market-to-MSMTUDA tasks, respectively. The target of person re-identiﬁcation (re-ID) is to match images of a person across different cameraviews. Because of its extensive numbers of applications, person re-ID has attracted attention fromboth academia and industry. In recent years, with the development of deep learning, supervised re-IDmethods, such as [26, 28, 23, 4, 20, 46, 42, 2], have gained impressive progress. However, there stillexist several drawbacks. First, these methods require intensive manual labeling, which is expensiveand time-consuming. Second, due to the domain gap, there is a signiﬁcant performance drop when a ∗ Equal Contribution † Corresponding authorPreprint. Under review. a r X i v : . [ c s . C V ] J un a) Original image (b) MMT [11] (c) WaveBlock (d) AWB Figure 1: The gradient-weighted class activation maps of MMT [11], WaveBlock, and AWB. Thedifferences in Frobenius norm between two maps for the three methods are . , . and . ,respectively.model trained on a source domain is tested on a target domain [7, 9]. Therefore, unsupervised domainadaptation (UDA) was introduced, which aims at learning a model on a labeled source domain andadapting it to an unlabeled target domain.Image-level adaptation, such as [7, 31], uses a generative adversarial network (GAN) [13] to transferthe image styles of the source domain to a target domain. Feature-level method like [45] investigatesunderlying feature invariance. However, the performances of these approaches are still unsatisfactorywhen compared to their fully-supervised counterparts. Recently, several clustering based methods,such as [25, 40, 10, 15], have been proposed, which employ clustering algorithms to group unan-notated target images to generate pseudo-labels for training. Although they achieve state-of-the-artperformance in various UDA tasks, their abilities are hindered by noisy pseudo-labels caused by theimperfect clustering algorithms and the limited feature transferability.To address the aforementioned problem, a dual network framework, Mutual Mean-Teaching (MMT)[11] was proposed, which trains two networks simultaneously and utilizes a temporally averagedmodel to produce reliable soft labels as supervision signals. Although this design reduces theampliﬁcation of training error to some degree, as the two networks converge, as shown in Fig.1, theyunavoidably become more and more similar, which weakens their complementarity and may makethem bias towards the same kind of noise. This limits further improvement in performance.To overcome the above limitations, we propose a novel module, namely the Attentive WaveBlock(AWB), under the dual network framework. The critical idea behind AWB is to create a differencebetween two neural networks to enhance their complementarity. In particular, we ﬁrst introducethe WaveBlock to modulate feature maps of the two networks with different block-wise waves.Then, we utilize an attention mechanism to force the networks to focus on discriminative features inthese regions, which further enlarges the difference between them. Here two kinds of combinationsare designed, i.e. pre-attention (Pre-A) and post-attention (Post-A), to produce such different anddiscriminative features. For Pre-A, the attention modules ﬁrst learn discriminative features, andthen WaveBlocks wave regions differently. For Post-A, WaveBlocks ﬁrst generate different waves,and then the attention modules learn discriminative features on the different waves. In Fig. 1, wevisualize the feature attention maps of the three mutual learning methods using a gradient-weightedclass activation map [24] and compute the difference in Frobenius norm between two maps A and B ,which is (cid:107) A − B (cid:107) F = (cid:113)(cid:80) i,j | a ij − b ij | . As shown in Fig. 1, from MMT [11] to WaveBlock, thedifference increases to some degree. Further, from WaveBlock to AWB, the attention mechanismenlarges the difference created before.We summarize our contributions as follows: • We introduce a parameter-free module, the WaveBlock, that can create a difference underthe dual network framework. It enhances the complementarity of the two networks andreduces the possibility that they become biased towards the same kind of noise. • We propose to utilize an attention mechanism to enlarge the difference between networks onthe basis of the WaveBlock and design two kinds of combination strategies, i.e. pre-attentionand post-attention. • The AWB module signiﬁcantly improves performances on UDA tasks for person re-ID, withnegligible computational increase. Compared with the state-of-the-art methods, we obtain2mprovements of . , . , . and . in mAP on Duke-to-Market, Market-to-Duke,Duke-to-MSMT, and Market-to-MSMT re-ID tasks. Mainstream algorithms for UDA tasks can be categorized into three classes. The ﬁrst are image-levelmethods. They use a GAN to transfer the source domain images to the target-domain style [38].For instance, PTGAN [31] transfers knowledge, while SPGAN [7] focuses on self-similarity anddomain-dissimilarity. However, unfortunately, the performance of these methods lags far behindtheir fully-supervised counterparts. The second category is feature-level methods. For example,[45] investigates three types of underlying invariance, i.e. exemplar-invariance, camera-invarianceand neighborhood-invariance. The last category is clustering based adaptation. These methods[9, 19, 40, 10] follow a similar general pipeline: they ﬁrst pre-train on the source domain and thentransfer the learned parameters to ﬁt the target domain. Due to the imperfect clustering algorithmsand big domain variance, the generated pseudo labels tend to contain noise, which hinders furtherimprovement in performance. Although, MMT [11] was introduced to alleviate this problem by usinga couple of neural networks to generate soft pseudo labels, as the training process goes on, the twoneural networks tend to converge and unavoidably share a high similarity. Therefore, it is necessaryto consider how to create different networks and enhance the complementarity. This is the startingpoint of our AWB.

Attention has been widely used to enhance representation learning in the ﬁelds of image classiﬁcation[27, 21, 36], object detection [3, 39, 8] and so on. For instance, convolutional block attention module(CBAM) [32] uses channel attention and spatial attention to explore "what" and "where" to focus.Non-local block [30] exploits global features. Further, fully-supervised state-of-the-arts person re-IDalgorithms, such as ConsAtt [47], SCAL [2], SONA [35], and ABD-Net [4], on several datasets(Market-1501 [41], DukeMTMC [43], CUHK03 [16], MSMT17 [31]) adopt an attention scheme.

DropBlock was proposed in [12] as a regularization method to drop units in a contiguous region of afeature map. Batch DropBlock Network (BDB) [5] uses a global branch and a feature dropping branchto keep the global salient representations and reinforce the attentive feature learning of local regions.Wu [34] uses multiple dropping branches on the basis of BDB to further boost the performance.

In this section, we ﬁrst simply review the Mutual Mean-Teaching (MMT) framework, then intro-duce our WaveBlock module. Finally, we present two different strategies for combining attentionmechanism with WaveBlock.

Brieﬂy, the MMT framework includes two identical networks with different initializations. Itspipeline is as follows: ﬁrst, the two networks are pre-trained on the source domain to obtain initializedparameters. Then, in each epoch, ofﬂine hard pseudo-labels are generated using a clustering algorithm.In each iteration of a given epoch, reﬁned soft pseudo-labels are produced by the two networks. Thehard pseudo-labels and reﬁned soft pseudo-labels generated by one network are then used together tosupervise the learning process of the other network. Finally, again in each iteration, the temporallyaveraged models are updated and used for prediction. For more details, please refer to [11].3 ifference

Figure 2: Overview of the WaveBlock module, which creates a difference between two networks bywaving blocks of feature maps differently. Speciﬁcally, a block is randomly selected and kept thesame, while feature values of other blocks are doubled to form a wave.

In order to enhance the complementarity of the two networks, we ﬁrst introduce the WaveBlockmodule to create a difference between the networks, which is illustrated in Fig. 2. Instead of droppingblocks as in [12] which may lose discriminant information, we modulate a given feature map withdifferent block-wise waves, so that differences are created between dual networks, and meanwhilethe original information is preserved to some extent.Given a feature map F ∈ R C × H × W , where C is the number of channels, H and W are spatialheight and width, respectively, and a waving rate r , we ﬁrst generate a random integer with uniformdistribution: X ∼ U (0 , [ H · (1 − r )]) , (1)where [ · ] is the rounding function. Then, we get the WaveBlock modulated feature map as F ∗ ∈ R C × H × W : F ∗ ijk = (cid:26) F ijk , X ≤ j < X + [ H · r ] , F ijk , otherwise. (2)This design modulates a given feature map with block-wise waves and meanwhile original informationis kept to some degree. When applying WaveBlocks to the feature maps F , F of two networks,respectively, the difference between the networks can be created by waving differently on blocks offeature maps. Let F ∗ , F ∗ denote the output feature maps of WaveBlock and X , X indicate thewaving random integers generated on the two networks; we will calculate the probability that thesame wave is generated for both. For simplicity, it is assumed that F and F have the same size.In order to enable F ∗ = F ∗ , we should make X = X . Since P ( X = X ) = [ H · (1 − r )][ H · (1 − r )] = 1[ H · (1 − r )] , (3)we have P ( F ∗ = F ∗ ) = P ( X = X ) = 1[ H · (1 − r )] . (4)If multiple GPUs are used for training, X will be generated independently in each GPU. In practice,we set r as . experimentally and four GPUs are used. Then, on feature maps with H = 16 , we have P ( F ∗ = F ∗ ) = 1[ H · (1 − r )] = 6 . · − . (5)Because the probability is too small for the waves of the two networks to be the same, we may saythat there is always a difference created between them. In this section, the attention mechanism is integrated with the WaveBlock module to learn discrimina-tive and different features. Two kinds of combination strategies are designed, including pre-attention(Pre-A) and post-attention (Post-A). 4 ttention ModuleAttention Module

Create a differenceFocus on different parts (a) Pre-attention

Attention ModuleAttention Module

Create a difference Focus on different parts (b) Post-attention

Figure 3: Two different combination strategies for the attention module and WaveBlock. WaveBlockcreates difference between two networks, while the attention mechanism focuses on learning differentand discriminative features.

To show that the proposed WaveBlock can be combined to general attention methods, two kindsof attention mechanisms are tried here. The ﬁrst one is the convolutional block attention module(CBAM) [32]. Given a feature map F ∈ R C × H × W , CBAM exerts a channel attention map M c and aspatial attention map M s on F sequentially: K = M c ( conv ( F )) ⊗ conv ( F ) , (6) K = M s ( K ) ⊗ K , (7)where conv denotes several convolution blocks and ⊗ denotes element-wise multiplication. InCBAM, the channel attention exploits the inter-channel relationship of features, while the spatialattention focuses on "where" an informative part is located.The second attention mechanism is the Non-local block [30]. Here we adopt its simpliﬁed version.Let F ∈ R C × H × W denote a feature map for Non-local block and θ denote a × convolution.Through θ , the number of channels of F are reduced from C to C/ , i.e. θ ( F ) ∈ R C × H × W .Similarly, another × convolution φ also reduces the number of channels from C to C/ , i.e. φ ( F ) ∈ R C × H × W . Then we collapse the spatial dimension of θ ( F ) and φ ( F ) into a singledimension, i.e. θ (cid:48) ( F ) ∈ R C × HW , φ (cid:48) ( F ) ∈ R C × HW . We obtain our matrix J ∈ R HW × HW : J = ( θ (cid:48) ( F )) T · φ (cid:48) ( F ) . (8)Next, we adopt H × W as the scaling factor for J , without using sof tmax . In the other branch, F isfed into a function g , which is a × convolution followed by a batch normalization layer. Similarly,we collapse the spatial dimension of g ( F ) into a single dimension and further apply a transpose toget g (cid:48) ( F ) ∈ R HW × C . Finally, we multiply J with g (cid:48) ( F ) , transpose and reshape its dimensions to C × H × W , and use another × convolution h to restore the channel dimension to C . As illustrated in Fig. 3(a), to combine the attention module with the WaveBlock, we ﬁrst try to arrangeit before the WaveBlock, which we call the Pre-attention (Pre-A) strategy. In this way, the attentionmodules ﬁrst learn discriminative features, and then WaveBlocks wave regions differently to producedifferent and discriminative features. Given a feature map F ∈ R C × H × W , we apply WaveBlock to ei-ther of the two attention modules mentioned before and obtain F ∗ = W aveBlock ( Attention ( F )) . Here, the attention modules are used to enlarge the difference of the backward gradients generated bythe WaveBlock. Although the WaveBlock is able to make the two networks work on different regionsof feature maps, some features learned from non-discriminative regions, such as backgrounds, maystill be similar. By combining the attention modules with the WaveBlock, the two networks focuson different and discriminative regions, such as the human body, and thus can learn more differentfeatures. The advantage of Pre-A is that the attention weights can be computed by using the complete5eature maps. This is more beneﬁcial to CBAM because the convolution used to compute its spatialattention will be affected near the border of waved regions.

The second combination strategy is shown in Fig. 3(b). We arrange the attention mechanism afterthe WaveBlock, which we call post-attention (Post-A). Correspondingly, the WaveBlocks ﬁrst waveregions differently, and then the attention modules learn discriminative features on the waved regionsto produce different and discriminative features. Given a feature map F ∈ R C × H × W , after passingthrough the WaveBlock, either of the two attention modules mentioned before can be applied. Thisproduces F ∗ = Attention ( W aveBlock ( F )) . Compared with Pre-A, although the waved regions may affect the computation of the attentionweights, directly applying the attention modules on the different waved regions is more efﬁcient forenlarging different features. Post-A is more beneﬁcial to the Non-local block because the non-localoperation reduces the impact of waved regions. [41] was obtained using six different cameras. The dataset has , labeled personsin , images. For training, there are , images of identities. For testing, the queryhas , images and gallery has , images. DukeMTMC-reID [43] contains , personsfrom eight cameras. Among them, , images of identities are used for training. For testing,there are , queries, and , gallery images. MSMT17 [31] is the most challenging andlargest re-ID dataset. It consists of , bounding boxes of , identities taken by cameras.There are , images for training while the query has , images and the gallery has , images. To evaluate our algorithm, we adopt the mean average precision (mAP) and CMC at rank-1,rank-5, and rank-10. No post-processing is used and we utilize single-query evaluation protocols. We essentially follow the same training settings as MMT [11]. For the source-domain pre-training, toensure that the improvement comes from a different mutual training but not an enhanced pre-trainednetwork, no change is made, i.e.

ResNet-50 [14] is used as the backbone network.For the ﬁrst stage of target-domain training, attention modules are trained without WaveBlockengaged. Speciﬁcally, for the Non-local block, two attention modules are plugged after Stage andStage of the ResNet-50 [14] backbone with random initialization. The two modules are trained for epochs with other parameters frozen. For CBAM, we follow the attention mechanism arrangementin [32]. The modules are initialized with ImageNet [6] pre-trained weights. Similarly, they are trainedfor epochs with other parameters frozen. For the second stage target-domain training, WaveBlockis added into two networks. Speciﬁcally, the attention module after Stage of ResNet-50 [14] isintegrated with WaveBlock to form AWB. For CBAM, the Pre-A design is used and for Non-localblock, the Post-A design is utilized. Because we successfully enhance the complementarity andmake it some more difﬁcult for the two neural networks biased towards the same kind of noise, thetraining process can last for more epochs. We train for epochs with all parameters engaged. Whenclustering, we select the optimal k value of k -means following [11], i.e. for Duke-to-Market, for Market-to-Duke, for Duke-to-MSMT and Market-to-MSMT. For testing, the WaveBlock isnot used. To prove the superiority of the AWB under the MMT [11] framework, we compare our model withstate-of-the-art methods on four domain adaptations tasks. The comparison results are shown in Table1. In terms of mAP, we gain a . , . , . and . improvement on Duke-to-Market, Market-to-Duke, Duke-to-MSMT, and Market-to-MSMT, respectively. As for rank-1, . , . , . and . improvements are obtained, respectively. Actually, the AWB can improve performancewith different k values stably. For instance, on Duke-to-Market, the Post-A (Non-local) improves6able 1: Comparison between our method and state-of-the-art algorithms. The results are reported onMarket-1501 [41], DukeMTMC [43] and MSMT17 [31]. Methods Duke-to-Market Market-to-DukemAP rank- rank- rank- mAP rank- rank- rank- SPGAN [7] . . . . . . . . TJ-AIDL [29] . . . . . . . . CFSM [1] . . − − . . − − UCDA [22] . . − − . . − − HHL [44] . . . . . . . . BUC [19] . . . . . . . . ARN [18] . . . . . . . . CDS [33] . . . . . . . . ENC [45] . . . . . . . . PDA-Net [17] . . . . . . . . UDAP [25] . . . . . . . . PCB-PAST [40] . . − − . . − − SSG [10] . . . . . . . . ACT [37] . . − − . . − − MMT [11] . . . . . . . . AWB (Pre-A with CBAM) . . . . . . . . AWB (Post-A with Non-local)

Methods Duke-to-MSMT Market-to-MSMTmAP rank- rank- rank- mAP rank- rank- rank- ENC [45] . . . . . . . . SSG [10] . . − . . . − . MMT [11] . . . . . . . . AWB (Pre-A with CBAM) . . . . . . . . AWB (Post-A with Non-local)

Table 2: The effectiveness of WaveBlock for creating a difference. "Stage" denotes the position ofthe WaveBlock. "-s" indicates that the same shape of Waveblock is adopted for both networks.

Methods MMT [11] Stage Stage Stage Stage Stage -s Stage -s Stage -s Stage -sDuke-to-Market mAP . . . . . . . . rank-1 . . . . . . . . Market-to-Duke mAP . . . . . . . . rank-1 . . . . . . . . mAP from . to . and . to . when k equals to and , respectively; onMarket-to-Duke, the Post-A (Non-local) improves mAP from . to . and . to . when k equals to and , respectively. To prove the efﬁcacy of each component in the AWB, we conduct ablation experiments onDukeMTMC to Market-1501 and Market-1501 to DukeMTMC tasks. The experimental resultsand analyses are reported below.

Effectiveness of WaveBlock for creating a difference.

One WaveBlock is arranged after differentstages of ResNet-50 [14], without attention mechanism. As shown in Table 2, arranging WaveBlockin different positions brings various improvements, with Stage 3 being the best position. However, ifthe same shape of Waveblock is adopted for both networks, performance becomes poorer. Therefore,even without an attention mechanism, the difference created still enhances the complementarity oftwo neural networks to some degree. In conclusion, it is necessary to introduce the WaveBlock withdifferent shapes to create a difference between two networks.

Effectiveness of the WaveBlock Design.

To illustrate the effectiveness of the WaveBlock design,the WaveBlock is replaced with the feature dropping block in [5]. Also, to avoid disturbance, noattention mechanism is used. When the replaced position is scheduled after Stage , mAPs are . and . for Duke-to-Market and Market-to-Duke tasks, respectively. When the replaced positionis scheduled after Stage , mAPs are . and . , respectively. Compared to using WaveBlock7able 3: Comparison between different numbers of WaveBlocks. "Stage" indicates the stages atwhich the Waveblock is integrated. Method Duke-to-Market Market-to-DukeMMT [11] Stage , , Stage , Stage , MMT [11] Stage , , Stage , Stage , mAP . . . . . . rank-1 . . . . . . Table 4: The effectiveness of AWB with CBAM. "CBAM" indicates that only CBAM is used. "Pre-A"denotes the performance of the Pre-A with CBAM while "Post-A" denotes the performance of thePost-A with CBAM.

Method Duke-to-Market Market-to-DukeMMT [11] CBAM Pre-A Post-A MMT [11] CBAM Pre-A Post-AmAP . . . . . . rank-1 . . . . . . as reported in Table 2, the performance of using DropBlock is poorer. The reason is that DropBlockdrops some discriminative and important features, which prevents the two neural networks fromﬁtting training data well. In contrast, the proposed Waveblock modulates a given feature map withpreserved original feature to some degree. How many WaveBlocks are needed in our proposed method?

We try to employ different numbersof WaveBlocks. Experimental results are shown in Table 3. Compared to Table 2, the conclusion isthat it does not gain signiﬁcant improvement with more WaveBlocks, and using one WaveBlock isenough to create difference between two neural networks.

Effectiveness of combining the attention mechanism with WaveBlock.

In this part, we try toprove the effectiveness of the attention mechanism in the AWB. Further, two combination designsfor two kinds of attention mechanisms are compared. WaveBlock is arranged after Stage . Theexperimental results are displayed in Table 4 and Table 5, respectively. As can be observed, forCBAM, Pre-A combination design is better than CBAM while Post-A combination design is worsethan CBAM. It is because the border of the waved feature maps may affect the convolution computingfor spatial attention, and the Pre-A design avoids this problem. For Non-local block, the performancesof both combination strategies are better than adding Non-local block directly. Speciﬁcally, thePost-A design is much better because directly applying attention modules on waved feature maps ismore efﬁcient to produce different and discriminative features and non-local operation reduces theimpact of waved regions. Quantiﬁcation of the created difference.

The differences created by WaveBlock and enlargedby attention mechanism are quantiﬁed in this part. We adopt Post-A with Non-local design, andWaveBlock is arranged after Stage . The difference is quantiﬁed by calculating the Frobenius normbetween two gradient-weighted class activation maps [24] of the same input after Stage 3 or theproposed modules, as illustrated in the introduction section. Further, the differences in Frobeniusnorm for all images are averaged to obtain ﬁnal quantiﬁed differences. As shown in Table 6, thequantiﬁed difference of WaveBlock is larger than MMT’s. The quantiﬁed difference is furtherenlarged by integrating attention mechanism with WaveBlock. In this paper, we ﬁrst propose a parameter-free module, the WaveBlock. Then, we design two kindsof combination methods, i.e. pre-attention and post-attention, to integrate our WaveBlock with theTable 5: The effectiveness of AWB with Non-local block. "Non-local" denotes only Non-local blockis used. "Pre-A" denotes the performance of our Pre-A with Non-local block while "Post-A" denotesthe performance of our Post-A with Non-local block.

Method Duke-to-Market Market-to-DukeMMT [11] Non-local Pre-A Post-A MMT [11] Non-local Pre-A Post-AmAP . . . . . . rank-1 . . . . . . Method Duke-to-Market Market-to-DukeMMT [11] Attention WaveBlock AWB MMT [11] Attention WaveBlock AWBDifference .

84 6 .

97 7 . .

72 6 .

70 6 . attention mechanism. We use the WaveBlock to create a difference between two networks underthe framework of MMT. An attention mechanism is also utilized to enlarge the difference and learndifferent and discriminative features on the basis of WaveBlock. By plugging our AWB into theMMT, the complementarity of the two networks is enhanced and the possibility of their being biasedtowards the same kind of noise is decreased. Extensive experiments show that our AWB under theMMT framework outperforms the state-of-the-art methods by a large margin. Broader Impact

Unsupervised domain adaptation (UDA) is regarded as an important step to improve person re-identiﬁcation performance in unknown target domains. That is because, in practical application, it isexpensive and time-consuming to label data in an unknown target domain while the transferability ofUDA algorithms is able to use unlabeled target data to improve pre-trained models. The practicalapplications include smart security, intelligent video surveillance and so on. Speciﬁcally, it canhelp us ﬁnd lost relatives faster and reduce the crime rate in our city. If it is maturely applied, itwill liberate a lot of manpower to improve automation and cut costs. However, it may also lead tounemployment of some people, such as security guard.

Acknowledgment

We thank Informatization Ofﬁce of Beihang University for the supply of High Performance ComputingPlatform. This work is also supported by Inception Institute of Artiﬁcial Intelligence and School ofMathematical Sciences of Beihang University.

References [1] Xiaobin Chang, Yongxin Yang, Tao Xiang, and Timothy M Hospedales. Disjoint label spacetransfer learning with common factorised space. In

Proceedings of the AAAI Conference onArtiﬁcial Intelligence , volume 33, pages 3288–3295, 2019.[2] Guangyi Chen, Chunze Lin, Liangliang Ren, Jiwen Lu, and Jie Zhou. Self-critical attentionlearning for person re-identiﬁcation. In

Proceedings of the IEEE International Conference onComputer Vision , pages 9637–9646, 2019.[3] Shuhan Chen, Xiuli Tan, Ben Wang, and Xuelong Hu. Reverse attention for salient objectdetection. In

Proceedings of the European Conference on Computer Vision (ECCV) , pages234–250, 2018.[4] Tianlong Chen, Shaojin Ding, Jingyi Xie, Ye Yuan, Wuyang Chen, Yang Yang, Zhou Ren, andZhangyang Wang. Abd-net: Attentive but diverse person re-identiﬁcation. In

Proceedings ofthe IEEE International Conference on Computer Vision , pages 8351–8361, 2019.[5] Zuozhuo Dai, Mingqiang Chen, Xiaodong Gu, Siyu Zhu, and Ping Tan. Batch dropblocknetwork for person re-identiﬁcation and beyond. In

Proceedings of the IEEE InternationalConference on Computer Vision , pages 3691–3701, 2019.[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In , pages 248–255. Ieee, 2009.[7] Weijian Deng, Liang Zheng, Qixiang Ye, Guoliang Kang, Yi Yang, and Jianbin Jiao. Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person9e-identiﬁcation. In

Proceedings of the IEEE conference on computer vision and patternrecognition , pages 994–1003, 2018.[8] Deng-Ping Fan, Wenguan Wang, Ming-Ming Cheng, and Jianbing Shen. Shifting more attentionto video salient object detection. In

Proceedings of the IEEE conference on computer visionand pattern recognition , pages 8554–8564, 2019.[9] Hehe Fan, Liang Zheng, Chenggang Yan, and Yi Yang. Unsupervised person re-identiﬁcation:Clustering and ﬁne-tuning.

ACM Transactions on Multimedia Computing, Communications,and Applications (TOMM) , 14(4):1–18, 2018.[10] Yang Fu, Yunchao Wei, Guanshuo Wang, Yuqian Zhou, Honghui Shi, and Thomas S Huang.Self-similarity grouping: A simple unsupervised cross domain adaptation approach for personre-identiﬁcation. In

Proceedings of the IEEE International Conference on Computer Vision ,pages 6112–6121, 2019.[11] Yixiao Ge, Dapeng Chen, and Hongsheng Li. Mutual mean-teaching: Pseudo label reﬁnery forunsupervised domain adaptation on person re-identiﬁcation. In

International Conference onLearning Representations , 2020.[12] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Dropblock: A regularization method forconvolutional networks. In

Advances in Neural Information Processing Systems , pages 10727–10737, 2018.[13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In

Advances in neuralinformation processing systems , pages 2672–2680, 2014.[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In

Proceedings of the IEEE conference on computer vision and pattern recognition ,pages 770–778, 2016.[15] Devinder Kumar, Parthipan Siva, Paul Marchwica, and Alexander Wong. Unsupervised domainadaptation in person re-id via k-reciprocal clustering and large-scale heterogeneous environmentsynthesis. In

IEEE Winter Conference on Applications of Computer Vision, WACV 2020,Snowmass Village, CO, USA, March 1-5, 2020 , pages 2634–2643. IEEE, 2020.[16] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. Deepreid: Deep ﬁlter pairing neural networkfor person re-identiﬁcation. In

Proceedings of the IEEE conference on computer vision andpattern recognition , pages 152–159, 2014.[17] Yu-Jhe Li, Ci-Siang Lin, Yan-Bo Lin, and Yu-Chiang Frank Wang. Cross-dataset personre-identiﬁcation via unsupervised pose disentanglement and adaptation. In

Proceedings of theIEEE International Conference on Computer Vision , pages 7919–7929, 2019.[18] Yu-Jhe Li, Fu-En Yang, Yen-Cheng Liu, Yu-Ying Yeh, Xiaofei Du, and Yu-Chiang Frank Wang.Adaptation and re-identiﬁcation network: An unsupervised deep transfer learning approachto person re-identiﬁcation. In

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops , pages 172–178, 2018.[19] Yutian Lin, Xuanyi Dong, Liang Zheng, Yan Yan, and Yi Yang. A bottom-up clusteringapproach to unsupervised person re-identiﬁcation. In

Proceedings of the AAAI Conference onArtiﬁcial Intelligence , volume 33, pages 8738–8745, 2019.[20] Hao Luo, Wei Jiang, Youzhi Gu, Fuxu Liu, Xingyu Liao, Shenqi Lai, and Jianyang Gu. A strongbaseline and batch normalization neck for deep person re-identiﬁcation.

IEEE Transactions onMultimedia , 2019.[21] Yuxin Peng, Xiangteng He, and Junjie Zhao. Object-part attention model for ﬁne-grained imageclassiﬁcation.

IEEE Transactions on Image Processing , 27(3):1487–1500, 2017.[22] Lei Qi, Lei Wang, Jing Huo, Luping Zhou, Yinghuan Shi, and Yang Gao. A novel unsupervisedcamera-aware domain adaptation framework for person re-identiﬁcation. In

Proceedings of theIEEE International Conference on Computer Vision , pages 8080–8089, 2019.1023] Ruijie Quan, Xuanyi Dong, Yu Wu, Linchao Zhu, and Yi Yang. Auto-reid: Searching fora part-aware convnet for person re-identiﬁcation. In

Proceedings of the IEEE InternationalConference on Computer Vision , pages 3750–3759, 2019.[24] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, DeviParikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-basedlocalization. In

Proceedings of the IEEE international conference on computer vision , pages618–626, 2017.[25] Liangchen Song, Cheng Wang, Lefei Zhang, Bo Du, Qian Zhang, Chang Huang, and Xing-gang Wang. Unsupervised domain adaptive re-identiﬁcation: Theory and practice.

PatternRecognition , page 107173, 2020.[26] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models: Personretrieval with reﬁned part pooling (and a strong convolutional baseline). In

Proceedings of theEuropean Conference on Computer Vision (ECCV) , pages 480–496, 2018.[27] Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, XiaogangWang, and Xiaoou Tang. Residual attention network for image classiﬁcation. In

Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition , pages 3156–3164, 2017.[28] Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. Learning discriminativefeatures with multiple granularities for person re-identiﬁcation. In

Proceedings of the 26th ACMinternational conference on Multimedia , pages 274–282, 2018.[29] Jingya Wang, Xiatian Zhu, Shaogang Gong, and Wei Li. Transferable joint attribute-identitydeep learning for unsupervised person re-identiﬁcation. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 2275–2284, 2018.[30] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks.In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages7794–7803, 2018.[31] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gapfor person re-identiﬁcation. In

Proceedings of the IEEE conference on computer vision andpattern recognition , pages 79–88, 2018.[32] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional blockattention module. In

Proceedings of the European Conference on Computer Vision (ECCV) ,pages 3–19, 2018.[33] Jinlin Wu, Shengcai Liao, Xiaobo Wang, Yang Yang, Stan Z Li, et al. Clustering and dynamicsampling based unsupervised domain adaptation for person re-identiﬁcation. In , pages 886–891. IEEE, 2019.[34] Xiaofu Wu, Ben Xie, Shiliang Zhao, Suofei Zhang, Yong Xiao, and Ming Li. Diversity-achievingslow-dropblock network for person re-identiﬁcation. arXiv preprint arXiv:2002.04414 , 2020.[35] Bryan Ning Xia, Yuan Gong, Yizhe Zhang, and Christian Poellabauer. Second-order non-local attention networks for person re-identiﬁcation. In

Proceedings of the IEEE InternationalConference on Computer Vision , pages 3760–3769, 2019.[36] Xuezhi Xiang, Zeting Yu, Ning Lv, Xiangdong Kong, and Abdulmotaleb El Saddik. Semi-supervised image classiﬁcation via attention mechanism and generative adversarial network. In

Eleventh International Conference on Graphics and Image Processing (ICGIP 2019) , volume11373, page 113731J. International Society for Optics and Photonics, 2020.[37] Fengxiang Yang, Ke Li, Zhun Zhong, Zhiming Luo, Xing Sun, Hao Cheng, Xiaowei Guo,Feiyue Huang, Rongrong Ji, and Shaozi Li. Asymmetric co-teaching for unsupervised crossdomain person re-identiﬁcation. arXiv preprint arXiv:1912.01349 , 2019.[38] Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven CH Hoi. Deep learningfor person re-identiﬁcation: A survey and outlook. arXiv preprint arXiv:2001.04193 , 2020.1139] Xiaoning Zhang, Tiantian Wang, Jinqing Qi, Huchuan Lu, and Gang Wang. Progressiveattention guided recurrent network for salient object detection. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pages 714–722, 2018.[40] Xinyu Zhang, Jiewei Cao, Chunhua Shen, and Mingyu You. Self-training with progressiveaugmentation for unsupervised cross-domain person re-identiﬁcation. In

Proceedings of theIEEE International Conference on Computer Vision , pages 8222–8231, 2019.[41] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalableperson re-identiﬁcation: A benchmark. In

Proceedings of the IEEE international conference oncomputer vision , pages 1116–1124, 2015.[42] Zhedong Zheng, Xiaodong Yang, Zhiding Yu, Liang Zheng, Yi Yang, and Jan Kautz. Jointdiscriminative and generative learning for person re-identiﬁcation. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pages 2138–2147, 2019.[43] Zhedong Zheng, Liang Zheng, and Yi Yang. Unlabeled samples generated by gan improve theperson re-identiﬁcation baseline in vitro. In

Proceedings of the IEEE International Conferenceon Computer Vision , pages 3754–3762, 2017.[44] Zhun Zhong, Liang Zheng, Shaozi Li, and Yi Yang. Generalizing a person retrieval modelhetero-and homogeneously. In

Proceedings of the European Conference on Computer Vision(ECCV) , pages 172–188, 2018.[45] Zhun Zhong, Liang Zheng, Zhiming Luo, Shaozi Li, and Yi Yang. Invariance matters: Exemplarmemory for domain adaptive person re-identiﬁcation. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 598–607, 2019.[46] Kaiyang Zhou, Yongxin Yang, Andrea Cavallaro, and Tao Xiang. Omni-scale feature learningfor person re-identiﬁcation. In

Proceedings of the IEEE International Conference on ComputerVision , pages 3702–3712, 2019.[47] Sanping Zhou, Fei Wang, Zeyi Huang, and Jinjun Wang. Discriminative feature learning withconsistent attention regularization for person re-identiﬁcation. In