Cross-View Image Retrieval -- Ground to Aerial Image Retrieval through Deep Learning
CCross-View Image Retrieval - Ground to AerialImage Retrieval through Deep Learning
Numan Khurshid − − − , Talha Hanif − − − ,Mohbat Tharani − − − , and Murtaza Taj − − − Computer Vision and Graphics Lab, School of Science and Engineering,Lahore University of Management Sciences, Pakistan { ,16060073 ,murtaza.taj } @lums.edu.pk, [email protected] Abstract.
Cross-modal retrieval aims to measure the content similaritybetween different types of data. The idea has been previously applied tovisual, text, and speech data. In this paper, we present a novel cross-modal retrieval method specifically for multi-view images, called Cross-view Image Retrieval
CVIR . Our approach aims to find a feature spaceas well as an embedding space in which samples from street-view imagesare compared directly to satellite-view images (and vice-versa). For thiscomparison, a novel deep metric learning based solution ”
DeepCVIR ”has been proposed. Previous cross-view image datasets are deficient inthat they (1) lack class information; (2) were originally collected for cross-view image geolocalization task with coupled images; (3) do not includeany images from off-street locations. To train, compare, and evaluate theperformance of cross-view image retrieval, we present a new 6 class cross-view image dataset termed as
CrossViewRet which comprises of imagesincluding freeway, mountain, palace, river, ship, and stadium with 700high-resolution dual-view images for each class. Results show that theproposed DeepCVIR outperforms conventional matching approaches onCVIR task for the given dataset and would also serve as the baseline forfuture research.
Keywords:
Cross-modal Retrieval · Cross-View Image Retrieval · Cross-View Image Matching · Deep Metric Learning.
Cross-view image matching (CVIM) attracted considerable attention of the re-searchers due to its growing applications in the fields of image geolocalization,GIS mapping, autonomous driving, augmented reality navigation, and robot res-cue [5,1]. Another key factor is the rapid increase in high resolution satellite andstreet-view imagery provided by platforms such as Google and Flickr. One ofthe most challenging task to address CVIM is to devise an effective method tofill-in the heterogeneity gap of the two types of images[14,18].We introduce cross-view image retrieval (CVIR) which is a special type ofcross-modal retrieval, which aims to enable flexible search and collect method a r X i v : . [ c s . C V ] M a y N. Khurshid et al.Freeway Mountain Palace River Ship Stadium
Fig. 1.
Some of the sample images from the developed
CrossViewRet dataset, pre-senting 6 distinct classes. Apart from view-point variations these images also exhibitseasonal changes, varying illumination, and different spatial resolution. across dual-view images. For query image taken from one view-point (say ground-view) it searches for all the similar images taken from the other view-point (sayaerial-view) in the database. The idea has evolved from the notion of cross-viewimage matching with one key difference. In standard cross-view image matchinga ground-view image is matched to its respective aerial-view image while relyingonly on the content of the images. We in contrast introduce CVIR in which thesystem for the given query image searches for all the similar images in a databaseconsidering contextual class information embedded in visual descriptors of theimages.Common practice for conventional retrieval system is representation learning.It tries to transform images to a feature space where distance between them couldbe measured directly [3]. However, in our case these representative features mustbe transmuted to another common embedding space to bridge the heterogeneitygap and compute similarity between them.In this paper, we present a novel cross-view image retrieval method, termedas Deep Metric Learning based cross-view image retrieval (DeepCVIR). Thismethod aims to retain the discrimination among visual features from differentsemantic groups and reduces the dual-view image disparities as well. Intended toachieve this objective, class information is retained in the learned feature spaceand pairwise label information are retained in the embedding space for all theimages. This is done by minimizing the discrimination loss of the images in boththe feature space as well as embedding space to ensure the learned embeddingsto be both discriminative in class information and view invariant in nature.Figure.2 illustrates our proposed framework in detail.The remainder of this paper is organized as follows: Section 2 reviews therelated work in cross-view image matching and cross-modal learning. Section 3presents the proposed model including problem formulation, DeepCVIR and im-plementation details. Section 4 explains the experimental setup including datasetwhile section 5 provides the results and analysis. Section 6 concludes the paper. ross-view Image Retrieval 3
Recent applications of cross-modal retrieval especially for text, speech, and im-ages in big-data opened new avenues which require improved solution for therecent problems. Existing technique applies cross-modal retrieval techniques tomulti-modal data but do not address variety of data in any single modality suchas multi-view image retrieval [15].Cross-view image matching could be taken as one of the potential problemsfor which Vo. et. al cross-matched and geo-localized street-view images of the 11cities of United States to their respective satellite-view images [13]. In which ex-perimentation using various versions of Siamese and Triplet networks for featureextraction with distance-based logistic loss have been carried out. While vali-dating the approach on another similar dataset
CV-USA
Hu. et. al combinedlocal features and global descriptors [5]. One of the major short comings of boththese datasets is that the street-view images are obtained from Google satelliteimage repository which totally ignores the off-street images. Another way tocross-match images is to detecting and matching the content of the images e.g.matching buildings in the street-view query image to the building in the aerialimages [12]. This particular approach intuitively failed to perform in the arealacking any tall structures or buildings with prominent features. Researchershave even tried to predict the ground-level scene layout from their respectiveaerial images, however, the same approach could not be extended for accurateimage matching and retrieval purpose[17].Image retrieval on the other hand has already been progressively used formulti-modal matching in the field of information retrieval[14]. The approach hasbeen validated for applications to match sentences to images, ranking images,and free-hand sketch based image retrieval[15,6,8,7,18].Moreover, metric learn-ing networks have been previously introduced for template matching tasks [16].We introduce cross-view image retrieval, employing the combination of metriclearning and image retrieval technique for class-based cross-view image match-ing.
One of the core ideas of this paper is to identify an efficient framework for CVIRusing the contextual information of the scene in image. The detailed approach ispresented in four different subsections: a) Problem Formulation, b)Deep FeatureExtraction, c) Feature Matching, d)DeepCVIR. Figure 2 visually explains theoverall architecture of the proposed approach.
We focus on the formulation of CVIR problem for CrossViewRet dataset D without losing generality of the topic. Dataset D contains two subsets: ground-view images D g and aerial-view images D a . In ground-to-aerial retrieval for the N. Khurshid et al.
ShipMountainRiver
StadiumFreewayPalace
Satellite-view Image Database
Indexing
Image feature extraction using pre-trained deep learning based model
Retrieval
Feature matching and retrieving most relevant images from the database
Top k similar retrieved images Feature VectorsInput Image, 𝐱 𝑖𝑔 Input Image, 𝐱 𝑖𝑎 Cross Entropy Loss … PredictOutput 𝐶 𝐶 𝐶 𝐶 𝑛 … GroundTruth 𝐶 𝐶 𝐶 𝐶 𝑛 …… Deep Metric Learning Classification DecisionMatchedUnmatchedCross Entropy Loss S t r ee t - v i e w Q u e r y S a t e llit e - v i e w M a t c h 𝑓(.)𝑔(.) 𝐮 𝑖 𝐯 𝑖 𝐽 𝑐 = −𝑦 𝑗𝑖 log(𝑦 𝑗𝑖 ) 𝐽 𝑚 = −𝑦 𝑗𝑖 log(𝑦 𝑗𝑖 ) 𝑦 𝑗𝑖 𝑦 𝑗𝑖 Fig. 2.
Overall process of Cross-view Image Retrieval involving: a) Indexing step whichidentifies the features of the query image and image to be matched (from database),b) Retrieval step matches image features and visualize the top k relevant images basedupon the retrieval score. given query image x g ∈ D g we aim to retrieve the set of all the relevant images D rel , where D rel ⊂ D a . Similarly, this problem could also be formulated foraerial-to-ground search and retrieval by replacing query image with x a and searchdata as D g . For this purpose, we assume a collection of n instances of ground-view and aerial-view image pairs, denoted as Ψ = { (x ia , x ig ) } ni =1 , where x gi is theinput ground-view image sample and x ai is the input aerial-view image samplefor the i th instance. Each pair of instances(x ia , x ig ) has been assigned a semanticlabel y ji . If i th instance belongs to j th class, y ji = 0, otherwise y ji = 1. Representation learning also termed as ”
Indexing ” in CVIR refers to learntwo functions for dual-view images containing same class information: u i = f (x gi ; φ g ) ∈ R d for the ground-view image and v i = f (x ai ; φ a ) ∈ R d for aerial-view image, where d is the dimensionality of features in their respective fea-ture spaces. φ g and φ a in the above two functions are the trainable weights ofthe street-view and satellite-view feature learning networks. Feature extractionstep for the cross-view image pair is influenced by benchmark deep supervisedconvolutional neural networks including VGG, ResNet-50, and Tiny-Inception-ResNet-v2 pretrained networks [10]. These networks are selected due to theirexceptional performance in object recognition and classification task. Unlike tra- ross-view Image Retrieval 5 ditional Siamese network, here two separate feature learning networks (withoutweight sharing policy) are employed for extracting features of street and satel-lite view images. Features acquired through this technique implicitly retain theclass information of the images irrespective of their visual viewpoint. Although,these representations might not be projected in the combined feature space forboth views still they share same dimensional footprint and could be comparedin an embedding space through matching. Figure. 2 (left side) shows the overallindexing procedure in detail. Features of the cross-view image pair are matched either through distance com-putation, metric learning, or deep networks with specialized loss functions. Tradi-tionally, matching techniques employ distance computation method of the paireddata ( u i , v j ). For instance, Euclidean distance for feature embeddings of thispaired data could be computed as D (Ψ) = (cid:107) u i − v j (cid:107) (1)where (cid:107) . (cid:107) denotes L2-norm operation. In distance metric learning especiallycontrastive embedding, a loss function implemented on top of point-wise distanceoperation, is minimized to learn the association of similar and dissimilar datapairs.It is mathematically computed as J con = (cid:88) i,j (cid:96) ij D (Ψ) + (1 − (cid:96) ij ) h ( α − D (Ψ) ) (2)where (cid:96) ij ∈ , h ( . ) = max (0 , h ) is hinge loss function and D (Ψ) is takenfrom (1). α is used to penalize the dissimilar pair distances for being smallerthan this predefined margin using hinge loss in the second part of (2). Similarly,Mahalanobis distance between the cross-view image pair features is computedas J ma = (( u i − v j ) (cid:48) C − ( u − v )) (3)where x i and z j are two points from the same distribution which has covariancematrix C . The Mahalanobis distance is the same as the Euclidean distance ifthe covariance matrix becomes the identity matrix. variation in each componentof the point.For each of the these matching measure if the retrieval score comes out tobe less than the given threshold (say 0.5), the feature pair is categorized assimilar and dissimilar otherwise. For image retrieval top k images are visualizedas relevant to the query image as shown in Figure 2. N. Khurshid et al.
The idea of transforming images from feature space to embedding space could beapplied by incorporating a deep learning model technically called as deep metriclearning network (DML) [16,11]. We in this research propose a residual deepmetric learning architecture optimized with the well known binary cross-entropyloss.
Reshaping 1D features in DeepCVIR
To exploit the contextual informationof the objects in image features we reshape 1D features (1024 ×
1) from indexingstep to 2D features (32 ×
32) in retrieval step. 2D convolution layers are thenemployed to extract significant information from concatenated 2D features ofthe matching images.
Residual Blocks in DeepCVIR
This DML network inspired from residuallearning comprises the combination of two standard residual units presented in[4]. The first residual unit consists of two convolution layers with an identitypath while the second one comprise a 1 × Cross-view image retrieval could be inherently divided into two sub-tasks namelySteet-to-satellite retrieval and Satellite-to-street retrieval. If for the given street-view query image, satellite-view relevant images are retrieved it is referred toas Str2Sat while the vice-versa case is referred to as Sat2Str in the rest of thepaper. We also investigate the effects of employing different activation functionsin DML networks.
In this research a new dataset
CrossViewRet has been developed to evaluate andcompare the performance of DeepCVIR framework. Previous cross-view imagedatasets are deficient in that they (1) lack class information about the contentof the image; (2) were originally collected for cross-view image geolocalizationtask with coupled images; (3) were specifically acquired for the purpose of au-tonomous vehicles therefore they do not include any images having off-street ross-view Image Retrieval 7 x x x 𝐮 𝑖 Concatenate Decision x x x Reshape x x
16 32 x x
16 16 x x
16 16 x16 x x16 x
16 16 x16 x Flatten S o f t m a x x x Activation FunctionConvolution Dense Addition
Shortcut Connection x x Residual Block 𝐯 𝑗 Fig. 3.
The proposed Deep Metric Learning network (S-DeepCVIR) employed forDeepCVIR technique consists of only one residual block (two residual units). For D-DeepCVIR and T-DeepCVIR, two and three stacked residual blocks are additionallyused in this network, respectively. The rest of the network structure remains the samefor all the three variations. locations. CrossViewRet comprise of images containing 6 classes including free-way, mountain, palace, river, ship, and stadium with 700 high resolution dual-view images for each class. The satellite-view images are collected from thebenchmark NWPU-RESISCS-45 dataset, while respective street-view images ofeach class are downloaded from Flickr image dataset using Flickr API [2]. Thedownloaded street-view images are then cross checked by human annotators andimages with obvious visual descriptions of the classes are selected for each class.The spatial resolution of satellite-view images is 256 ×
256 and the street-viewimages are of variable sizes; however, they have been resized before employingfor experimentation. The dataset has been made public for future use .CrossViewRet is a very complex dataset. Unlike existing cross-view dataset[13,5] which contain ground and aerial images of the same location. We, on theother hand, do not constraint the images to be of same geo-location. Rather,we focus on visual contents in the scene regardless of any transformation in theimages, weather conditions, and variation in day and night time in the scene. Asshown in Fig.1, the ground view in sample image of mountain class contains snowwhereas its target aerial view image does not. Similarly, the ground view imagesof palace and stadium class are taken during night and aerial view contains daytime drone images. However, the aerial view in river example is satellite imagewhich is totally different than top view drone images. We use two independent networks for feature learning and embedding learning.In case of feature learning VGGNet, ResNet, and Inception-ResNet-v2 with pre-trained ImageNet weights are fine-tuned. Two independent sub-networks have https://cvlab.lums.edu.pk/category/projects/imageretrieval N. Khurshid et al. been employed for learning the discriminating class-wise features of both theviews. The architecture of the proposed DML network has been explained insection 3.4. The standard 80/20 train-validation splitting criteria was used forCVIR dataset to fine-tune and train all the feature networks and variants ofDML networks respectively. Query images used for evaluation were randomlytaken from the validation split of the data.Deep learning networks have been trained on a Nvidia RTX 2080Ti GPU inKeras. For training feature networks, we employ Stochastic Gradient Descent(SGD) with initial learning rate 0.00001 and a learning rate decay with patience5. For DML network, ADAM with initial learning rate of 0.001 has been used.Early stopping criteria of 15 epochs has been used to halt training for all thenetworks. We evaluated the performance of cross-view image retrieval with not only thestandard measures of Precision, Recall, and F1-Score but also evaluated Aver-age Normalized Modified Retrieval Rank (ANMRR), Mean Average Precision(mAP), and P@K (read as Precision at K) [9]. P@K is the percentage of querieswhich the ground truth image class are in one of the first K retrieved results.Here we only employ P@5 measure for our analysis.
Table 1.
Performance comparison of features computed with state-of-the art-architectures (IncepRes-v2=Tiny-Inception-ResNet-v2).
Feature Network Similarity Measure ANMRR ↓ mAP ↑ p@5 ↑ Precision Recall F1-ScoreResNet-50 Euclidean 0.42 0.17 0.16 0.50 0.50 0.50IncepRes-v2 0.42 0.17 0.16 0.50 0.50 0.50VGG-16 0.41 0.18 0.15 0.48 0.48 0.48ResNet-50 Contrastive 0.05 0.90 0.88 0.50 0.50 0.50IncepRes-v2 0.40 0.20 0.16 0.50 0.50 0.50VGG-16 0.29 0.41 0.40 0.50 0.50 0.50ResNet-50 Mahalanobis 0.42 0.17 0.16 0.50 0.50 0.50IncepRes-v2 0.42 0.16 0.15 0.50 0.50 0.50VGG-16 0.42 0.17 0.19 0.50 0.50 0.50ResNet-50 DeepCVIR-DML 0.03 0.93 0.94 0.94 0.94 0.94IncepRes-v2 0.29 0.22 0.23 0.52 0.52 0.52VGG-16
Validation of the proposed DeepCVIR approach for this type of challengingdataset demands extensive assessment. We therefore provide a comparative anal- ross-view Image Retrieval 9
VGG-16 ResNet-50 Inception-ResNet-v2 P e r ce n t a g e V a li d a ti on A cc u r ac y Deep Image Features
EuclideanContrastiveMahalanobis
DeepCVIR L o ss Epochs
S-DeepCVIR Train S-DeepCVIR ValD-DeepCVIR Train D-DeepCVIR Val
T-DeepCVIR Train T-DeepCVIR Val a) Features vs. Similarity Metric b) Convergence rate of DeepCVIR
Fig. 4.
Additional results: a) showing the performance of similarity measuring tech-niques with various deep supervised features and, b) showing the convergence rate of { S,T, and D } -DeepCVIR networks during training and validation. ysis of the approach using various state-of-the-art techniques as well as variantsof the proposed method. Various deep features have been previously used for the task of same-view imageretrieval; however, view-invariant features of multi-modal images plays a pivotalrole in CVIR. Table 1 shows that although Inception-ResNet-v2 may outperformthe VGGNet and ResNet on ImageNet challenge yet it failed to extract the mostoptimal features for cross-view image matching. In addition, the performanceof various distance computation methods illustrates that the problem is morecomplex and could not be solved by linear distances i.e. Euclidean or Contrastiveloss embedding. Figure 4(a) also confirms the improvement of learning behaviorin term of percentage validation accuracy.
Table 2.
Comparison of different variations of proposed architecture.
DML Type Feature Type ActivationFunction ANMRR ↓ mAP ↑ p@5 ↑ Precision Recall F1-ScoreS-DeepCVIR ResNet-50 eLU 0.04 0.93 0.94 0.50 0.50 0.50S-DeepCVIR Leaky ReLU 0.03 0.93 0.94 0.94 0.94 0.94S-DeepCVIR ReLU 0.03 0.93 0.95 0.93 0.93 0.93D-DeepCVIR ResNet-50 ReLU 0.04 0.92 0.94 0.92 0.92 0.92T-DeepCVIR ReLU 0.05 0.89 0.92 0.90 0.90 0.90S-DeepCVIR VGG-16 Leaky ReLU
D-DeepCVIR Leaky ReLU 0.02 0.95 0.97 0.95 0.95 0.95T-DeepCVIR Leaky ReLU 0.02 0.96 0.98 0.95 0.95 0.95
The proposed DeepCVIR architecture involves the contribution of DML networkwhich assists the learning process by efficient learning of the embedding spaceto discriminate similar and dissimilar pairs. However, to evaluate the learningroutine of the DML network we tried variants of DML with the single, doubleand triple combination of the proposed residual blocks termed as S-DeepCVIR,D-DeepCVIR, and T-DeepCVIR, respectively.
Impact of the Number of Residual Blocks in DeepCVIR
In generalincreasing the number of residual blocks in a network supports the overall per-formance; however, in our case S-DeepCVIR with least number of residual blocksoutperforms the rest of the DeepCVIR networks. This was beyond our antici-pation, but one cannot neglect the simplicity of this task as compare to otherrecognition tasks. It could be concluded that the number of learnable param-eters of S-DeepCVIR are enough to separate similar and dissimilar features.ANMRR and mAP values in Table 2 illustrates that although all the variants ofDeepCVIR performed better than other matching techniques still S-DeepCVIRperformed extraordinarily for the given task of Str2Sat as well as Sat2Str. Theirconvergence curves illustrated in Figure 4(b) due to their less number of learn-able parameters, represents significantly earlier and much lower loss with respectto the number of epochs as compare to rest of the combinations.
Our proposed S-DeepCVIR framework performs equally well on both Str2Satand Sat2Str tasks. Results in Table 3 shows that although the average ANMRRvalues remain comparative for all the variants of DeepCVIR architecture stillS-DeepCVIR with VGG features achieves minimum average ANMRR of 0 . T-SNE plot is a very effective tool to visualize the data in two dimensionalplane for better analysis. We adapted this approach to witness and validate
Table 3.
Performance comparison of VGG-16 features with { S,T, and D } -DeepCVIRnetworks for Street-to-satellite (Str2Sat) and Satellite-to-street (Sat2Str) retrieval task.Features DeepCVIR Str2Sat Sat2Str AverageANMRRANMRR ↓ mAP ↑ p@5 ↑ ANMRR ↓ mAP ↑ p@5 ↑ VGG-16 S-DeepCVIR
30 20 10 0 10 20 3040302010010203040 01 40 20 0 20 40402002040 01 40 20 0 20 40402002040 01 40 20 0 20 40402002040 01 a) ResNet-50 b) ReLu c) Leaky ReLu d) eLu
40 20 0 20 4040302010010203040 01 40 20 0 20 4060402002040 01 40 20 0 20 40402002040 01 40 20 0 20 40402002040 01 e) VGG-16 f) S-DeepCVIR g) D-DeepCVIR h) T-DeepCVIR
Fig. 5. t-SNE plots showing learning behavior of DML for: ResNet-50 features trainedwith S-DeepCVIR employing various activation functions (in top row), and VGG-16features using different number of residual blocks in DML network (in bottom row). the contribution of DML in transforming features to embedding space. Figure5(a,f) shows that image features are distributed among the whole region of theplot and hence it is very difficult to measure the correspondence among sameand different feature just by using a linear distance. DML separates them intodistinguishable clusters.Although no class information was explicitly provided to the network duringtraining still it successfully clustered the similar pairs into six different classes.It is also observed from the figures that use of different activation functions andmultiple residual blocks does not contribute to improvement of the overall result.
We propose a cross-view image retrieval system for which we developed a cross-view dataset named CrossViewRet. The dataset consists of street-view andsatellite-view images for 6 distinct classes having 700 images per class. Theproposed DeepCVIR system consists of two parts: a) a fine-tuned deep featurenetwork, and b) a deep metric learning network trained on image pairs fromCrossViewRet dataset. Given features for two images, the proposed residualDML network decides if the two images belong to the same class. In addition anablative study and a detailed empirical analysis on different activation functionsand number of residual blocks in DML network have also been performed. Thisshows that our proposed DeepCVIR network performed significantly well for theproblem of cross-view retrieval.
References
1. Arth, C., Schmalstieg, D.: Challenges of large-scale augmented reality on smart-phones. Graz University of Technology, Graz pp. 1–4 (2011)2. Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmarkand state of the art. Proceedings of the IEEE (10), 1865–1883 (2017)3. Gksu, ., Aptoula, E.: Content based image retrieval of remote sensing images basedon deep features. In: 26th Signal Processing and Communications ApplicationsConference (SIU). pp. 1–4 (2018)4. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks.In: European Conference on Computer Vision. pp. 630–645. Springer (2016)5. Hu, S., Feng, M., Nguyen, R.M., Hee Lee, G.: Cvm-net: Cross-view matching net-work for image-based ground-to-aerial geo-localization. In: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. pp. 7258–7267 (2018)6. Lin, Z., Ding, G., Hu, M., Wang, J.: Semantics-preserving hashing for cross-viewretrieval. In: Proceedings of the IEEE conference on computer vision and patternrecognition. pp. 3864–3872 (2015)7. Liu, L., Shen, F., Shen, Y., Liu, X., Shao, L.: Deep sketch hashing: Fast free-handsketch-based image retrieval. In: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. pp. 2862–2871 (2017)8. Ma, L., Lu, Z., Shang, L., Li, H.: Multimodal convolutional neural networks formatching image and sentence. In: Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition. pp. 2623–2631 (2015)9. Napoletano, P.: Visual descriptors for content-based retrieval of remote-sensingimages. International Journal of Remote Sensing (5), 1–34 (2018)10. Nazir, U., Khurshid, N., Bhimra, M.A., Taj, M.: Tiny-inception-resnet-v2: Usingdeep learning for eliminating bonded labors of brick kilns in south asia. In: Pro-ceedings of the IEEE Conference on Computer Vision and Pattern RecognitionWorkshops. pp. 39–43 (2019)11. Tharani, M., Khurshid, N., Taj, M.: Unsupervised deep features for remote sensingimage matching via discriminator network. arXiv preprint arXiv:1810.06470 (2018)12. Tian, Y., Chen, C., Shah, M.: Cross-view image matching for geo-localization inurban environments. In: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition. pp. 3608–3616 (2017)13. Vo, N.N., Hays, J.: Localizing and orienting street views using overhead imagery.In: European Conference on Computer Vision. pp. 494–509. Springer (2016)14. Wang, K., Yin, Q., Wang, W., Wu, S., Wang, L.: A comprehensive survey oncross-modal retrieval. arXiv preprint arXiv:1607.06215 (2016)15. Wei, Y., Zhao, Y., Lu, C., Wei, S., Liu, L., Zhu, Z., Yan, S.: Cross-modal retrievalwith cnn visual features: A new baseline. IEEE Transactions on Cybernetics47