IEEE/ACM Transactions on Audio, Speech, and Language Processing | 2021

Learning Cross-Lingual Mappings in Imperfectly Isomorphic Embedding Spaces

 
 
 

Abstract


One mainstream method in cross-lingual word embeddings is to learn a linear mapping between two monolingual embedding spaces using a training dictionary. Successful linear mappings require isomorphic embedding spaces. However, monolingual embedding spaces are not perfectly isomorphic, and therefore, a linear mapping cannot align them accurately. In this study, we assume that two embedding spaces are composed of near-isomorphic translation pairs (NearITP) and non-isomorphic translation pairs. Owing to the nature of similar substructures, NearITP can make linear mapping work well. Motivated by this, we design a screening strategy to identify NearITP effectively. Based on this strategy, we find that the proportion of NearITP in the commonly used training dictionary is relatively low, leading to sub-optimal results. To address this problem, we propose a general framework that can be combined with any of the mapping methods, which further boosts subsequent mapping. Experimental results demonstrate that our framework is an improvement over existing mapping-based methods, and outperforms state-of-the-art models on two public data sets. Moreover, we show that our framework can be successfully generalized to contextual word embeddings such as multilingual BERT (mBERT), and further enhances the cross-lingual properties of mBERT.

Volume 29
Pages 2630-2642
DOI 10.1109/TASLP.2021.3097935
Language English
Journal IEEE/ACM Transactions on Audio, Speech, and Language Processing

Full Text