Neurocomputing | 2021

Divide-and-Merge the embedding space for cross-modality person search

 
 
 
 

Abstract


Abstract This study considers the problem of text-based person search, which aims to find the corresponding person of a given text description in an image gallery. Existing methods usually learn a similarity mapping of local parts between image and text, or embed the whole image and text into a unified embedding space. However, the relevance between local and the whole is largely underexplored. In this paper, we design a Divide-and-Merge Embedding (DME) learning framework for text-based person search. DME explicitly 1) models the relations between local parts and global embedding, 2) incorporates local details into global embedding. Specifically, we design a Feature Dividing Network (FDN) to embed the input into K locally guided semantic representations by self-attentive embedding, each representation depicts a local part of the person. Then, we propose a Relevance based Subspace Projection (RSP) method for merging diverse local representations to a compact global embedding. RSP helps the model to obtain discriminative embedding by jointly minimizing the redundancy of local parts and maximizing the relevance between local parts and global embedding. Extensive experimental results on three challenging benchmarks, i.e., CUHK-PEDES, CUB and Flowers datasets, have demonstrated the effectiveness of the proposed method.

Volume 463
Pages 388-399
DOI 10.1016/J.NEUCOM.2021.08.058
Language English
Journal Neurocomputing

Full Text