Cross-view image synthesis using geometry-guided conditional GANs
CCross-view image synthesis using geometry-guided conditional GANs
Krishna Regmi ∗∗ Center for Research in Computer Vision (CRCV), University of Central Florida, Orlando, FL, USA
Ali Borji
Markable.AI, Newyork, USAABSTRACT
We address the problem of generating images across two drastically di ff erent views, namely ground (street) and aerial(overhead) views. Image synthesis by itself is a very challenging computer vision task and is even more so when generationis conditioned on an image in another view. Due the di ff erence in viewpoints, there is small overlapping field of view andlittle common content between these two views. Here, we try to preserve the pixel information between the views so thatthe generated image is a realistic representation of cross view input image. For this, we propose to use homography as aguide to map the images between the views based on the common field of view to preserve the details in the input image.We then use generative adversarial networks to inpaint the missing regions in the transformed image and add realism to it.Our exhaustive evaluation and model comparison demonstrate that utilizing geometry constraints adds fine details to thegenerated images and can be a better approach for cross view image synthesis than purely pixel based synthesis methods.
1. Introduction
Novel view synthesis is a long-standing problem in computervision. Earlier works in this area synthesize views of single ob-jects or natural scenes with small variation in viewing angle.For generating views of single objects in a uniform backgroundor scenes, the task regards learning a mapping or transforma-tion across views. With a small camera movement, there isa high degree of overlap in field of views resulting in imageswith high content overlap. The generative models can learn tocopy large parts of the image content from the input to the out-put and perform the synthesis task satisfactorily. Despite this,view synthesis task is very challenging due to presence of mul-tiple objects in the scene. The network needs to learn the objectrelations and occlusions in the scene.Generating cross-view natural scenes conditioned on imagesfrom drastically di ff erent views (e.g., generating top-view fromstreet view scene) is very painstaking. This is mainly becausethere is very little overlap between the corresponding field ofviews. Thus, simply copying and pasting pixels from one viewto another would not be a solution. Rather, it is needed to learnthe object classes present in the input view and to understandthe correspondences in the target view with appropriate objectrelations and transformations (i.e., geometric reasoning).In this work, we address the problem of synthesizing ground-level images from overhead imagery and vice versa using con-ditional Generative Adversarial Networks [Mirza and Osindero ∗∗ Corresponding author: e-mail: [email protected] (Krishna Regmi), [email protected] (Ali Borji)This work extends our CVPR 2018 paper [Regmi and Borji (2018)] (2014)]. Also, when possible, we guide the generative networksby feeding homography transformed images as inputs to im-prove the synthesized results. Basically, conditional GANs tryto generate new images from conditioning variables as input.The conditioning variables could be other images, text descrip-tions, class labels, etc.Our first approach exploits the success of the first GAN-based image-to-image translation network put forward by Isolaet al. (2017) as a general purpose architecture on multiple im-age translation tasks. This work translates images of objects orscenes which are represented by RGB images, gradient fields,edge maps, aerial images, sketches, etc across these representa-tions. Thus, it essentially operates on di ff erent representationsof images in a single view. We use this architecture as a startingpoint (base model) in our task and obtain encouraging results.The limitation of this approach to our problem, however, is thatthe images to be transformed to are very di ff erent as they comefrom two drastically di ff erent views, have small fields of viewoverlap, and objects in the images might be occluded. As a re-sult, learning to map the pixels between the views is di ffi cultas the corresponding pixels in two views may represent di ff er-ent object classes. To address this challenge, here we proposeto use the semantic segmentation maps of target view imagesto regularize the training process. This helps the network tolearn the semantic class of the pixels in the target view and toguide the network to generate the target pixels. By encouragingthe networks to generate the segmentation maps in target view,the network learns the semantic classes of each pixel which areimportant cues in assisting to generate cross-view images thatpreserve semantic information from the source view to the tar-get view.The next approach we take to solve the cross-view imagesynthesis task is to exploit the geometric relation between theviews to guide the synthesis. For this, we first compute the ho- a r X i v : . [ c s . C V ] J u l mography transformation matrix between the views and thenproject the aerial images to street-view perspective. By doingso, we obtain an intermediate image that looks very close to thetarget view image but not as realistic and with some missingregions. Now, our problem reduces to preserving the scene lay-out and details while filling in the missing regions and addingrealism to the transformed image. For this, we use cGAN archi-tectures described in previous approach. We also use di ff erentcGANs that work specifically for inpainting and realism tasksto preserve the pixel information from the homography trans-formed image in a controlled manner.To summarize, we propose the following methods. We startwith the simple image-to-image translation network of Isolaet al. (2017) as a baseline (here called Pix2pix ). We then pro-pose two new cGAN architectures that generate images as wellas segmentation maps in the target view. Augmenting seman-tic segmentation generation to the architectures helps improvethe quality of generated images. The first architecture, called
X-Fork , is a slight modification of the baseline, forking at thepenultimate block to generate two outputs, target view imageand segmentation map. The second architecture, called
X-Seq ,has a sequence of two baseline networks connected. The targetview image generated by the first network is fed to the secondto generate its corresponding segmentation map. Once trained,both architectures are able to generate better images than thebaseline that generates only the target view images. This im-plies that learning to generate segmentation map along with theimage indeed improves the quality of generated images. Wealso use homography to transform the aerial image to groundview and feed the transformed image to these networks to fur-ther improve the results. Finally, we propose a method to pre-serve details from the homography transformed image in a con-trolled setting to generate the street view images. It constitutestwo subtasks: a) generating missing regions by inpainting, andb) adding realism by using GAN to preserves details visible inaerial view into the street view image. We call this approach
H-Regions . Throughout the paper, H in a method’s name indi-cates that the input is the homography transformed image.
2. Related Work
Zhai et al. (2017) explored the relationship between thecross-view images by learning to predict the semantic layoutof the ground image from its corresponding aerial image. Theyused the predicted layout to synthesize ground-level panorama.Prior works relating the aerial and ground imageries have ad-dressed problems such as cross-view co-localization [Lin et al.(2013); Vo and Hays (2016)], ground-to-aerial geo-localization[Lin et al. (2015); MH and Lee (2018)] and geo-tagging thecross-view images [Workman et al. (2015)]. Recently, the im-ages generated by our cross-view image synthesis approachhave been successfully used to bridge the domain gap betweenaerial and street-view images in geo-localization tasks [Regmiand Shah (2019)].Cross-view relations have also been studied between egocen-tric (first person) and exocentric (surveillance or third-person)domains for di ff erent purposes. Human re-identification by matching viewers in top-view and egocentric cameras havebeen tackled by establishing the correspondences between theviews in Ardeshir and Borji (2016). Soran et al. (2014) utilizethe information from one egocentric camera and multiple exo-centric cameras to solve the action recognition task. Existing works on viewpoint transformation have been con-ducted to synthesize novel views of the same objects [Dosovit-skiy et al. (2017); Tatarchenko et al. (2016); Zhou et al. (2016)].Zhou et al. (2016) proposed models that learn to copy the pixelinformation from the input view and utilize them to preservethe identity and structure of the objects to generate new views.Tatarchenko et al. (2016) trained an encode-decoder network toobtain 3D models of cars and chairs which they later used togenerate di ff erent views of an unseen car or chair. Dosovitskiyet al. (2017) learned generative models by training on 3D ren-derings of cars, chairs and tables and synthesized intermediateviews and objects by interpolating between views and models. Goodfellow et al. (2014) are the pioneers of GenerativeAdversarial Networks that are very successful at generatingsharp and unblurred images, much better compared to exist-ing methods such as Restricted Boltzmann Machines [Hintonet al. (2006); Smolensky (1986)] or deep Boltzmann Machines[Salakhutdinov and Hinton (2009)].Conditional GANs synthesize images conditioned on di ff er-ent parameters during both training and testing. Examples in-clude conditioning on labels of MNIST to generate digits byMirza and Osindero (2014), on image representations to trans-late an image between di ff erent representations by Isola et al.(2017), and generating panoramic ground-level scenes fromaerial images of the same location by Zhai et al. (2017). Reedet al. (2016) synthesize images conditioned on detailed textualdescriptions of the objects in the scene, and Zhang et al. (2017)improved on that by using a two-stage Stacked GAN. Kim et al. (2017) utilized the GAN networks to learn the re-lation between images in two di ff erent domains such that theselearned relations can be transferred between the domains. Sim-ilar work by Zhu et al. (2017) learned mappings between un-paired images using cycle-consistency loss. They assume thata mapping from one domain to the other and back to the firstshould generate the original image. Both works exploited largeunpaired datasets to learn the relation between domains and for-mulated the mapping task between images in di ff erent domainsas a generation problem. Zhu et al. (2017) compare their gener-ation task with previous works on paired datasets by Isola et al.(2017). They conclude that the result with paired images is theupper-bound for their unpaired examples. Song et al. (2017) propose geometry-guided adversarial net-works to synthesize identity-preserving facial expressions. Thefacial geometry is used as a controlled input to guide the net-work to synthesize facial images with desired expressions. Sim-ilar work by Kossaifi et al. (2017) improves the visual qualityof synthesized images by enforcing a mechanism to control theshapes of the objects. They map the generator’s output to amean shape and implicitly enforce the geometry of the objectsand also add skip connections to transfer priors to the generatedobjects.
Pathak et al. (2016) generated missing parts of images us-ing networks trained jointly with adversarial and reconstruc-tion losses and produced sharp and coherent images. Yeh et al.(2017) tackle the problem of image inpainting by searching forthe encoding of the corrupted image that is closest to anotherimage in the latent space and passing it through the generatorto reconstruct the image. The closeness is defined based on theweighted context loss of the corrupted image, and a prior lossthat penalizes unrealistic images. Yang et al. (2017) propose amulti-scale patch synthesis approach for high-resolution imageinpainting by jointly optimizing on image content and textureconstraint.
3. Background on GANs
Th Generative Adversarial Network (GAN) proposed byGoodfellow et al. (2014) consists of two adversarial networks:a generator and a discriminator that are trained simultaneouslybased on the min-max game theory. The generator G is opti-mized to map a d -dimensional noise vector (usually d = D , on the other hand, is optimized toaccurately distinguish between the synthesized images comingfrom the generator and the real images coming from the truedata distribution. The objective function of such a network is minG maxD L GAN ( G , D ) = E x ∼ p data ( x ) [ logD ( x )] + E z ∼ p z ( z ) [ log (1 − D ( G ( z )))] , (1)where x is real data sampled from data distribution p data and z is a d -dimensional noise vector sampled from a Gaussian dis-tribution p z .Conditional GANs synthesize images looking into someauxiliary variable which may be labels [Mirza and Osindero(2014)], text embeddings [Zhang et al. (2017); Reed et al.(2016)] or images [Isola et al. (2017); Zhu et al. (2017); Kimet al. (2017)]. In conditional GANs, both the discriminator andthe generator networks receive the conditioning variable repre-sented by c in Eqn. (2). The generator uses this additional infor-mation during image synthesis while the discriminator makesits decision by looking at the pair of conditioning variable andthe image it receives. Real pair input to the discriminator con-sists of true image from distribution and its corresponding labelwhile the fake pair consists of the synthesized image and thelabel. For conditional GAN, the objective function is minG maxD L cGAN ( G , D ) = E x , c ∼ p data ( x , c ) [ logD ( x , c )] + E x (cid:48) , c ∼ p data ( x (cid:48) , c ) [ log (1 − D ( x (cid:48) , c ))] , (2)where x (cid:48) = G ( z , c ) is the generated image.In addition to the GAN loss, previous works [e.g., Isola et al.(2017); Zhu et al. (2017); Pathak et al. (2016)] have tried to minimize the L L L L L L L minG L L ( G ) = E x , x (cid:48) ∼ p data ( x , x (cid:48) ) [ || x − x (cid:48) || ] , (3)The objective function for such conditional GAN network isthe sum of Eqns. (2) and (3).Considering the synthesis of the ground level imagery ( I g )from aerial image ( I a ), the conditional GAN loss and L I a and I g are reversed. minG maxD L cGAN ( G , D ) = E I g , I a ∼ p data ( I g , I a ) [ logD ( I g , I a )] + E I a , I (cid:48) g ∼ p data ( I a , I (cid:48) g ) [ log (1 − D ( I (cid:48) g , I a ))] , (4) minG L L ( G ) = E I g , I (cid:48) g ∼ p data ( I g , I (cid:48) g ) [ || I g − I (cid:48) g || ] , (5)where I (cid:48) g = G ( I a ). The objective function for an image-to-imagetranslation network is the sum of conditional GAN loss in Eqn.(4) and L L network = λ L cGAN ( G , D ) + λ L L ( G ) , (6)where, λ and λ are the balancing factors between the losses.
4. Framework
In this section, we discuss the baseline methods and the pro-posed architectures for the task of cross-view image synthesis.
The naive way to approach this task is to consider it as animage to image translation problem. We run the experiments inthe following settings as our baselines.
For this, the generator is an encoder-decoder network thattakes in an image in first view as input and learns to generatethe image in the other view as output.
The network takes an image in first view as input and gen-erates a single output of 6 channels, the first 3 channels corre-spond to the RGB image and the next three channels representthe segmentation map. The L1 loss and adversarial loss arecomputed over six channels of output and the correspondingground truth images.
Generator Network GIa Discriminator Network D P(real) ~ [0,1] Sg’Ig’IaIg Ia (a) X-Fork architecture.
Generator Network G1Discriminator NetworkD1P(real) ~ [0,1]Ig’Ia Generator Network G2 Discriminator NetworkD2P(real) ~ [0,1]Ia IaIg Sg’SgIg’ Ig’ (b) X-Seq architecture.
Fig. 1:
Our proposed network architectures. a) X-Fork: Similarto baseline architecture except that G forks to synthesize image andsegmentation map in target view, and b) X-Seq: a sequence of twocGANs, G1 synthesizes target view image that is used by G2 for seg-mentation map synthesis in corresponding view. In both architectures, I a and I g are real images in aerial and ground views, respectively. S g is the ground-truth segmentation map in street-view. I (cid:48) g and S (cid:48) g aresynthesized image and segmentation map in ground view. Our first architecture, known as the Crossview Fork, is shownin Figure 1a. The generator network is forked to synthesize twooutputs of 3 channels each, the first output is the RGB imageand the second output is the segmentation map both in targetview. The fork-generator architecture is shown in Figure 2.The first six blocks of decoder share the weights. This is be-cause the image and segmentation map contain a lot of sharedfeatures. The number of kernels used in each layer (block) ofthe generator are shown below the blocks.The inherent idea behind this architecture is multi-task learn-ing by the generator network. When the generator is enforcedto learn the semantic class of the pixels together with the imagesynthesis, this helps to improve the image synthesis task. Thegenerated segmentation map serves as an auxiliary output. Theobjective function for this network is shown in Eqn. 7. L X − Fork = λ L cGAN ( G , D ) + λ L L ( G ( I g , I g (cid:48) )) + λ L L ( G ( S g , S g (cid:48) )) (7)Note that the L loss is minimized for images as well as seg-mentation maps whereas the adversarial loss is optimized forimages only. This is because we only care about pixel accuracyfor segmentation maps. Our second architecture uses a sequence of two cGAN net-works as shown in Figure 1b. The first network performs cross-view image synthesis and the generated image is fed to thesecond network as a conditioning input to synthesize its corre-sponding segmentation map. This two-stage end-to-end learn-ing of image and segmentation map synthesis produces im-proved image quality compared to the network without second
Aerial Image Street view ImageStreet view Segmentation
64 128256 512 512 512 512 512
Upconvolution + BN +
Dropout + ReLU(UBDR)
512 512 512512 5122561286464128256512512 512 512 512
Convolution + BN+ lReLU (CBL)Convolution + lReLU (CL) & & Upconvolution + BN + ReLU (UBR)
Fig. 2:
Generator of X-Fork architecture in Figure 1a. BN meansbatch-normalization layer. The first six blocks of decoder shareweights, forking at the penultimate block. The number of channelsin each convolution layer is shown below each blocks. stage. The joint objective function for X-Seq architecture isshown in Eqn. 8 below. L X − S eq = L network ( G , D ) + L network ( G , D ) (8) Another approach that we take here is to feed the homogra-phy transformed image as an input to the translation network.Our hypothesis is that the majority of the scene from the firstview is transformed into the second perspective using the ho-mography and this should ease the synthesis task. The largemissing regions in the transformed images are mostly related tosky and buildings.
The network takes the homography transformed image as in-put and generates target view image and the segmentation mapstacked together as a 6-channel output; first 3 channels for im-age and the next three channels for its segmentation map.
Here, we use the homography transformed image I ah as in-put to the network architecture proposed in subsection 4.2.1.The hypothesis behind this idea is that the use of transformedimages as input should ease the cross-view synthesis task com-pared to synthesizing by feeding the aerial images I a directly. In this setup, we feed the homography transformed image I ah rather than the original input image I a as input to the X-Seqnetwork of subsection 4.2.2. In this method, we attempt to preserve the structural detailsvisible in the aerial view images and guide the network to trans-fer those details to the synthesized ground view images. Forthis, we use the homography transformed image as input to ourmethod and solve the synthesis task in following subtasks:
Subtask I:
The homography transformed image (I ah in Fig-ure 3) has a large portion of missing region (R ) in the image.Our first task is to fill in the missing region in the transformedimages. We use an encoder-decoder network that takes I ah asinput and generates only the upper half of the image (I g (cid:48) ). Ia Iah Ig’ IgR1R2
Fig. 3: An aerial image I a shown in the left is first transformedto street-view perspective using homography (I ah ). The trans-formed image needs inpainting in upper box region (R ) andfurther processing in car region (R ). These two regions aregenerated and the region in between is copied from homogra-phy transformed image I ah to obtain I g (cid:48) . We further train I g (cid:48) tosmooth the region boundaries and add realism to it. I g is thecorresponding ground truth image. Subtask II:
The street-view images are recorded by a cam-era mounted on a car’s dashboard. Therefore, all the street-viewimages contain a part of the car’s hood around the lower centralregion (R ) in them which can be seen in image I g in Figure3. Note that, this scenario is present in SVA dataset and canbe avoided in real applications where the dash camera can bemounted such that the hood of the car is not captured in theframes. The homography transformed image I ah also has a cararound that region which has been transformed from the aerialview of the car but it does not realistically represent a car instreet-view. To address this, we mask a probable car-region andtrain a small network dedicated to learn mapping of the car re-gion. This helps generate a realistic car region in ground viewimages (See region R in image I g (cid:48) ).Once we have images generated from the first two tasks, wecopy them to their respective spatial locations in homographytransformed image I ah creating a street-view image I g (cid:48) and pre-serving the pixels of remaining regions. Copying pixels in thismanner helps us preserve the structural information that hasbeen transformed using homography. The problem with thisapproach is that images do not look realistic at region bound-aries. So, we further train another network to add realism tothis image. Subtask III:
GANs are successful at generating images thatlook very realistic to human eyes. Here, we train a conditionalGAN architecture on I g (cid:48) . For this, we first define bands aroundthe region boundaries as shown in Figure 3. We formulate theloss function to preserve (by copying) the pixel information out-side the bands to the output image while at the same time addingrealism to the whole image. This step helps a lot to improve thevisual quality of the synthesized image.We now define our loss functions for the subtasks. For sub-task one, we use conditional GAN network to inpaint missingregions in I ah by optimizing the network on adversarial and L losses for the missing regions only. For subtask two, we onlyconsider region R by masking out the remaining regions in in-put and output images and optimizing for adversarial and L losses for car region only. Once we have results from the abovetwo subtasks, we compute I (cid:48) g as shown in Eqn. 9. I g (cid:48) = I Inpaint (cid:12) M + I car (cid:12) M + I ah (cid:12) (M - M - M ) (9)Here, I Inpaint is the image generated from the inpainting net-work of subtask one. I car is the car image generated in subtask two. M and M are 3-channel binary masks for regions R andR , M being the 3-channel all ones image. The masks M andM are manually computed looking at the homography trans-formed image (I ah ) in Figure 3. This was done for a singleframe only and worked well for all the images in the dataset. Ifthe hood of the car was not visible in the street-view image, wewouldn’t even need region R2 and correspondingly mask M . I g (cid:48) is fed to the realism network to generate the final image inthe target view. (cid:12) is the element-wise product.
5. Experimental Setting
For the experiments in this work, we use three datasets de-scribed in the following.
Dayton.
This cross-view image dataset is provided by Vo andHays (2016). It consists of more than 1M pairs of street-viewand overhead view images collected from 11 di ff erent US cities.We select 76,048 image pairs from Dayton images and cre-ate a train / test split of 55,000 / × × a2g and ground to aerial g2a directions. CVUSA.
We recruit this dataset [Workman et al. (2015)] fordirect comparison of our work with Zhai et al. (2017). It con-sists of 35,532 / / test split of image pairs. FollowingZhai et al. (2017), the aerial images are center-cropped to 224 ×
224 and then resized to 256 × ×
256 inour experiments.
SVA.
The Surround Vehicle Awareness (SVA) dataset [Palazziet al. (2017)] is a synthetic dataset collected from Grand TheftAuto V (GTAV) video game. The game camera is toggled be-tween frontal and bird’s eye view to simultaneously captureimages in the two views at each game time step. We use thetrain / test split as provided in the dataset. The original datasethas 100 sets of training set images and 50 sets of test set im-ages. The consecutive frames in each set are very similar toeach other, so we use every tenth frame to remove redundancyin the dataset. Finally, we have a training set of 46,030 im-age pairs and a test set of 22,254 image pairs. The images areresized to 256 ×
256 for experiments in this work. Sample im-ages from SVA dataset are shown in the leftmost and rightmostcolumns of Figure 7. We use this dataset for experiments inaerial to ground a2g direction only. Note that, homography re-lated experiments are performed on this dataset only.The proposed Fork and Sequence networks learn to gener-ate the target view images and segmentation maps conditionedon source view image or their homography transformed im-age. Training procedure requires the images as well as theirsemantic segmentation maps. The CVUSA dataset has anno-tated segmentation maps for ground view images, but for SVAand Dayton datasets such information is not available. To com-pensate, we use one of the leading semantic segmentation meth-ods, known as the RefineNet [Lin et al. (2017)]. This network isFig. 4:
RGB image pairs from train set (left), segmentation masksfrom pre-trained RefineNet [Lin et al. (2017)] overlaid on them (mid-dle) and segmentation masks generated by X-Fork method overlaid onthem (right). pre-trained on outdoor scenes of the Cityscapes dataset [Cordtset al. (2016)] and is used to generate the segmentation mapsthat are utilized as ground truth maps. These semantic mapshave pixel labels from 20 classes (e.g., road, sidewalk, build-ing, vegetation, sky, void, etc). Figure 4 shows image pairsfrom the Dayton dataset and their segmentation masks overlaidin both views. As can be seen, the segmentation mask (label)generation process is not perfect since it is unable to segmentparts of buildings, roads, cars, etcetera in images.
We use homography as a preprocessing step to transform thevisual features from aerial images to ground perspective. Sincethe locations of aerial and ground view camera are fixed in theSVA dataset, first we randomly pick a pair of images from thedataset. We then manually select four points in the aerial im-age and find their corresponding locations in the ground-viewimage and use them to compute the homography matrix thattransforms the aerial image to ground view and vice versa. Sur-prisingly, this method works well and avoids expensive compu-tations for homography estimation usually done for each pairof images separately by computing SIFT features of the imagesand then finding the keypoints in the images. We also triedcomputing the SIFT features and then finding keypoints in twoimages and transforming images based on those points. Thismethod could not find the corresponding points in two views,most likely because of very large perspective variation betweenthe views.We use the conditional GAN architecture very similar toIsola et al. (2017) as the base architecture. The generator isan encoder-decoder network with blocks of Convolution, BatchNormalization [Io ff e and Szegedy (2015)] and activation lay-ers. Leaky ReLU with a slope of 0.2 is used as the activationfunction in the encoder, whereas the decoder has ReLU acti-vation except for its final layer where Tanh is used. The firstthree blocks of the decoder have a Dropout layer in betweenBatch normalization and activation layer, with dropout rate of50%. The discriminator is taken as it is from the Isola et al.(2017). For the experiments on the SVA dataset, we removedtwo blocks of CBL and UBDL from the generator architecture,primarily to save training time. We observed that removal ofthese blocks did not have much impact on the quality of syn-thesized images. Also, note that the synthesized semantic mapsare 3-channel RGB images which e ff ectively mitigated the classimbalance among the semantic classes. This was primarilydone to consider all 20 semantic classes during the training andreduce bias towards dominant classes like houses, trees, roadand sky. Had the semantic classes been limited to the dominant ones only, it would regularize the synthesized images to notlearn the less prevalent objects in the target view images. Also,we brought in some confidence by the success of the pix2pix insynthesizing 3-channel segmentation maps from RGB images.The convolutional kernels used in the networks are 4 × S patialFullConvolution , and upsamples the input by 2. Theconvolutional operation downsamples the images by 2. Nopooling operation is used in the networks. The λ and λ usedin the objective function for di ff erent networks are the balanc-ing factors between the GAN loss and the L λ at 1 and λ at 100 for Fork and Seq models. For realismtask in the H-Regions method, λ = λ = ff ectiveness by Salimans et al. (2016), we use one-sided label smoothing to stabilize the training process, replac-ing 1 by 0.9 for real labels. During the training, we utilizeddi ff erent data augmentation methods such as random jitter andhorizontal flipping of images. The network is trained end-to-end with weights initialized using a random Gaussian distribu-tion with zero mean and 0.02 standard deviation. Our methodsare implemented in Torch [Collobert et al. (2011)]. Our codeand data is available online at: https: // github.com / kregmi / cross-view-image-synthesis.git.We train the networks for 100 epochs on low resolution and35 epochs on high resolution images of Dayton dataset andfor 20 epochs on SVA dataset. Experiments are conducted onCVUSA dataset for comparison with the work of Zhai et al.(2017). Following their setup, we train our architectures for 30epochs using the Adam optimizer and moment parameters β = β =
6. Evaluation and Model Comparison
We have conducted experiments in a2g (aerial-to-ground)and g2a (ground-to-aerial) directions on the Dayton dataset and a2g direction only on CVUSA and SVA datasets. We considerimage resolutions of 64 ×
64 and 256 ×
256 on the Dayton datasetwhile for experiments on CVUSA and SVA datasets, 256 × It is not straightforward to evaluate the quality of synthesizedimages [Borji (2018)]. In fact, evaluation of GAN methods con-tinues to be an open problem [Theis et al. (2016)]. Here we uti-lize four quantitative measures and one qualitative measure toevaluate our methods.
For the subjective evaluation of di ff erent methods, we run auser study on the images synthesized using these methods. Weshow an aerial image along with the corresponding images inground view synthesized using seven di ff erent methods to 10users. We specifically ask each user to select the most realisticimage that also contains the most visual details from the aerialview image. Inception Score:
A common quantitative measure used inGAN evaluation is the
Inception Score proposed by Salimanset al. (2016). The core idea behind the inception score is toassess how diverse the generated samples are within a classwhile being meaningfully representative of the class at the sametime. One major criticism regarding the Inception score is thatthe CNN trained on ImageNet objects may not be adequate forother scene datasets (our case). To address this, here we usethe AlexNet model [Krizhevsky et al. (2017)] trained on Placesdataset [Zhou et al. (2017)] with 365 categories to compute theinception score. The Places dataset has images similar to thosein our datasets.We observe that the confidence scores predicted by the pre-trained model on our datasets are dispersed between classes formany samples and not all the categories are represented by theimages. Therefore, we compute inception scores on Top-1 andTop-5 classes, where ”Top-k” means that top k predictions foreach image are unchanged while the remaining predictions aresmoothed by an epsilon equal to (1 - (cid:80) (top-k predictions)) / (n-kclasses). Accuracy:
In addition to inception score, we compute the top-k prediction accuracy between real and generated images. Weuse the same pre-trained Alexnet model to obtain annotationsfor real images and class predictions for generated images. Wecompute top-1 and top-5 accuracies. For each setting, accura-cies are computed in two ways: 1) considering all images, and2) considering real images whose top-1 (highest) prediction isgreater than 0.5. Below each accuracy heading, the first columnconsiders all images whereas the second column computes ac-curacies the second way.
KL(model (cid:107) data):
We compute the KL divergence betweenthe model generated images and the real data distribution forquantitative analysis of our work, similar to some generativeworks [Che et al. (2016); Nguyen et al. (2017)]. We again usethe same pre-trained Alexnet described earlier. The lower KLscore implies that the generated samples are closer to the realdata distribution.
SSIM, PSNR and Sharpness Di ff erence: We employ threemeasures from the image quality assessment literature to eval-uate our models, similar to Mathieu et al. (2016); Lediget al. (2017); Shi et al. (2016); Park et al. (2017). Structural-Similarity (SSIM) measures the similarity between the imagesbased on their luminance, contrast and structural aspects. SSIMvalue ranges between -1 and +
1. Peak Signal-to-Noise Ratio(PSNR) measures the peak signal-to-noise ratio between twoimages to assess the quality of a transformed (generated) im-age compared to its original version. Sharpness di ff erence (SD)measures the loss of sharpness during image generation. For each of these score, the higher the the better. Please refer toRegmi and Borji (2018) for details on how to compute thesescores. FID Score:
An alternative metric to evaluate the quality ofgenerated images is by computing Frechet Inception Distance[Heusel et al. (2017)] between the generated samples and thereal images. We use the same AlexNet model (as above) pre-trained on the Places Dataset to compute the FID score. Thelower value of FID score for a method, the better. The FIDscores that we obtain in this work are relatively larger than num-bers reported in other works mainly because of the variations inthe image statistics of the Places Dataset used during the train-ing and our test images.
We conduct the qualitative and quantitative evaluation on 3datasets. We report homography results on SVA dataset only.This is because the aerial view image for SVA dataset containshigh overlap with the field of view of street-view image and thusapplication of homoraphy to preserve the details from aerial im-age seemed valid for this dataset compared to the other two.We conduct the user studies to compare the images synthe-sized using di ff erent methods as well as illustrate qualitative re-sults in Figures 5, 6 and 7 and conduct an in-depth quantitativeevaluation on test images of the datasets. Dayton dataset.
For 64 ×
64 resolution experiments, the net-works are modified by removing the last two blocks of CBLfrom discriminator and encoder, and the first two blocks ofUBDR from decoder of the generator. We run experiments onall three methods. Qualitative results are depicted in Figure5 (left). The results a ffi rm that the networks have learned totransfer the image representations across the views. Generatedground level images clearly show details about road, trees, sky,clouds, and pedestrian lanes. Trees, grass, road, house roofs arewell rendered in the synthesized aerial images.For 256 ×
256 resolution synthesis, we conduct experimentson all three architectures and illustrate the qualitative resultsin Figure 5 (right). We observe that the images generated inhigh resolution contain more details of objects in both viewsand are less granulated than those in low resolution. Houses,trees, pedestrian lanes, and roads look more natural.
CVUSA dataset.
Test results on the CVUSA dataset show thatimages generated by the proposed methods are visually bettercompared to Zhai et al. (2017) and Isola et al. (2017). Ourproposed methods are more successful in transforming the pixelclasses and thus generating correct class objects in target viewcompared to the baselines.
SVA dataset.
The qualitative results on the SVA dataset foraerial to street-view synthesis is shown in Figure 7. The pro-posed methods are capable at generating roads, car-hood, mark-ers on road, sky and other details in the images. We observethat the images generated by the proposed H-Regions methodcontain more details around the central regions. This is due toenforcing the network to preserve those details from the aerialimages.
User Study.
We conducted the perceptual test over 100 testimages on 10 subjects to compare the images synthesized by a2ga2g synthesis g2aPix2pix X-SO X-Fork X-Seq Pix2pix X-SO X-Fork X-Seqg2a synthesis a2ga2g synthesis g2aPix2pix X-SO X-Fork X-Seq Pix2pix X-SO X-Fork X-Seqg2a synthesis
Fig. 5:
Example images generated by di ff erent methods in low (64 ×
64) resolution (left) and high (256 × a2g and g2a directions on the Dayton dataset. a2g synthesis Zhai et al Pix2pix X-SO X-Fork X-Seq
Fig. 6:
Qualitative results of our methods and baselines on CVUSAdataset in a2g direction. First two columns show true image pairs, nextfour columns show images generated by Zhai et al. (2017), X-Pix2pix[Isola et al. (2017)], X-SO, X-Fork and X-Seq methods respectively.
Table 1: % of user preferences over images synthesized by di ff erentmethods (over 100 images from the SVA dataset).X-Pix2pix X-Fork X-Seq H-Pix2pix H-Fork H-Seq H-Reg.4 7.8 16.8 12.6 14.2 18.2 26.4 di ff erent methods on the SVA dataset. The results are pre-sented in Table 1. The most preferred method is H-Regions,closely contested by H-Seq and X-Seq methods. The results il-lustrate the following: a) The use of homography transformedinput drastically outperforms corresponding experiments withuntransformed aerial image as input, and b) Users preferred theimages synthesized using H-Regions because of the method’sability to preserve the pixel information onto the target view. We report the quantitative results on three datasets next.
Dayton dataset.
The quantitative scores over di ff erent mea-sures are provided in Tables 2, 3, 4, 5 and 6 under the columnsDayton (64 x 64) and Dayton (256 x 256) for low and high res-olution synthesis.Inception Score: The scores for X-Fork generated images areclosest to that of real data distribution for Dayton dataset in lowresolution in both directions and also in high resolution in a2g direction. The X-Seq method works best for g2a synthesis inhigh resolution for Dayton dataset. Inception scores on top-kclasses follow a similar pattern as in all classes (except for Top-1 class on low resolution and Top-5 class on high resolution in g2a direction over Dayton dataset). Accuracy: Results are shown in Table 3. We observe that theX-Fork method works better at low resolution whereas X-Seqis better at high resolution synthesis in both directions.KL(model (cid:107) data): The scores are provided in Table 4. As itcan be seen, our proposed methods generate much better resultsthan the baselines. X-Fork generates images very similar to realdistribution in all experiments except on the high resolution a2g experiment where X-Seq is slightly better than X-Fork.FID Score: The FID scores are presented in Table 5. X-Fork performs the best on lower resolution images while X-Seqworks the best on higher resolution images in both directions a2g and g2a .SSIM, PSNR, and SD: The scores are reported in Table 6.The X-Seq model works the best in a2g direction while X-Forkoutperforms the rest in the g2a direction. CVUSA dataset.
The quantitative evaluation is shown in Ta-bles 2, 3, 4 and 6 under the column CVUSA.Inception Score: The X-Seq method works best for CVUSAdataset in terms of inception score.Accuracy: Results are shown in Table 3. Images generatedwith X-Fork method obtain the best accuracy closely followedby X-Seq method.KL(model (cid:107) data): The scores are provided in Table 4. As itcan be seen, our proposed methods generate much better resultsthan the baseline. X-Fork generates images very similar to realdistribution and X-Seq very close to it.FID Score: X-Seq works the best on the CVUSA datasetclosely followed by X-Fork network. See Table 5.SSIM, PSNR, and SD: Images generated using X-Fork arebetter than other methods in terms of SSIM, PSNR and SD. Wefind that X-Fork improves over Zhai et al. by 5.03% in SSIM,8.93% in PSNR, and 12.35% in SD.
SVA dataset.
The quantitative evaluation on SVA dataset isillustrated in Table 7.Inception Score: H-Regions generates images that have theinception score closest to that of real data.Accuracy: H-Seq method performs best in terms of accuracy.KL(model (cid:107) data): Images synthesized using X-Seq methodhave the closest distribution to the ground truth distributionamong all the methods.FID Score: The H-Regions performs the best in terms of FIDscore in SVA dataset. The H-models perform better than theirX- counterparts.
Aerial Homography X-Pix2pix X-SO X-Fork X-Seq H-Pix2pix H-SO H-Fork H-Seq H-Regions GT
Fig. 7:
Example images generated by di ff erent methods in a2g direction for SVA dataset.
Table 2:
Inception Scores of models over
Dayton and
CVUSA datasets.
Dir. Methods Dayton (64 ×
64) Dayton (256 × (cid:29) all Top-1 Top-5 all Top-1 Top-5 all Top-1 Top-5classes class classes classes class classes classes class classesZhai et al. (2017) – – – – – – Real Data 2.2096 1.6961 2.3008 3.7090 2.5590 3.7900 4.9971 3.4122 5.1150X-Pix2pix 1.7970 1.3029 1.6101 3.5676 2.0325 2.8141 – – –
X-SO 1.4798 1.2163 1.5042 3.3397 2.0232 g2a X-Fork – – –
X-Seq 1.7854 – – –
Real Data 2.1408 1.4302 1.8606 3.8979 2.3146 3.1682 – – –
SSIM, PSNR, and SD: X-Seq achieves the highest numbersin terms of SSIM and PSNR whereas H-Regions has the bestSD.Also, H-Pix2pix already performs very good compared to X-Pix2pix because homography simplified the learning task bytransforming the image to the target view. H- methods outper-form their X- counterparts for most of the evaluation metrics.Because there is no consensus in evaluation of GANs, we hadto use several scores. Theis et al. (2016) show that these scoresoften do not agree with each other and this was observed inour evaluations as well. Nonetheless, we find that the proposedmethods are consistently superior to the baselines in terms ofquantitative and qualitative evaluations.
7. Discussion and Conclusion
We explored image generation using conditional GANs be-tween two drastically di ff erent views. Generating semantic seg-mentations together with images in the target view helps thenetworks learn better images compared to the baselines. Us-ing homography to guide the cross-view synthesis allows pre-serving the overlapping regions between the views. Extensivequalitative and quantitative evaluations testify the e ff ectivenessof our methods. Future research can explore other cues suchas edge maps to facilitate synthesis tasks. Also, e ff orts can be put towards automatically finding regions that we defined man-ually. The challenging nature of the problem leaves room forfurther improvements. Potential application of this work can bein bridging the big domain-gap between street-view and aerialimagery for cross-view image geo-localization, multi-view ob-ject synthesis and so on. References
Ardeshir, S., Borji, A., 2016. Ego2top: Matching viewers in egocentric andtop-view videos, in: Computer Vision - ECCV 2016 - 14th European Con-ference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings,Part V, pp. 253–268. doi: .Borji, A., 2018. Pros and cons of gan evaluation measures. arXiv preprintarXiv:1802.03446 .Che, T., Li, Y., Jacob, A.P., Bengio, Y., Li, W., 2016. Mode regularized gener-ative adversarial networks. arXiv e-prints abs / Accuracies: Top-1 and Top-5 on
Dayton and
CVUSA datasets.
Dir. Methods Dayton (64 ×
64) Dayton (256 × (cid:29) Top-1 Top-5 Top-1 Top-5 Top-1 Top-5Accuracy (%) Accuracy (%) Accuracy (%) Accuracy (%) Accuracy (%) Accuracy (%)Zhai et al. (2017) – – – – – – – –
X-Seq 4.83 5.56 19.55 24.96 – – – – g2a X-SO 0.10 0.00 0.98 0.28 8.52 12.57 27.35 32.76 – – – –
X-Fork – – – –
X-Seq 1.55 2.99 6.27 8.96
Table 4:
KL Divergence between model and data distributions on
Dayton and
CVUSA datasets.
Dir. Method Dayton(64 ×
64) Dayton(256 × Zhai et al. (2017) – – . ± . . ± . . ± .
88 59 . ± . . ± .
71 7 . ± .
37 414 . ± . ± . ± . ± X-Seq 6 . ± . ± . ± . . ± .
90 7 . ± . – X-SO 16 . ± . . ± .
22 –g2a X-Fork ± ± X-Seq 7 . ± .
92 7 . ± . – Table 5:
Frechet Inception Distance (FID) scores on
Dayton and
CVUSA datasets.
Dir. Method Dayton(64 ×
64) Dayton(256 × Zhai et al. (2017) – – . .
10 64 .
97 644 . .
47 185 . . X-Pix2pix 323 .
54 71 . – X-SO 412.38 108.14 –g2a X-Fork . – X-Seq 388 . Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S., 2017.Gans trained by a two time-scale update rule converge to a local nash equi-librium, in: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R.,Vishwanathan, S., Garnett, R. (Eds.), Advances in Neural Information Pro-cessing Systems 30. Curran Associates, Inc., pp. 6626–6637.Hinton, G.E., Osindero, S., Teh, Y.W., 2006. A fast learning algorithm fordeep belief nets. Neural Comput. 18, 1527–1554. URL: http://dx.doi.org/10.1162/neco.2006.18.7.1527 , doi: .Io ff e, S., Szegedy, C., 2015. Batch normalization: Accelerating deep networktraining by reducing internal covariate shift, in: Proceedings of the 32NdInternational Conference on International Conference on Machine Learn-ing - Volume 37, JMLR.org. pp. 448–456. URL: http://dl.acm.org/citation.cfm?id=3045118.3045167 .Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A., 2017. Image-to-image translationwith conditional adversarial networks. CVPR .Kim, T., Cha, M., Kim, H., Lee, J.K., Kim, J., 2017. Learning to discover cross-domain relations with generative adversarial networks, in: Precup, D., Teh,Y.W. (Eds.), Proceedings of the 34th International Conference on MachineLearning, PMLR, International Convention Centre, Sydney, Australia. pp.1857–1865.Kossaifi, J., Tran, L., Panagakis, Y., Pantic, M., 2017. Gagan: Geometry-awaregenerative adverserial networks. arXiv preprint arXiv:1712.00684 .Krizhevsky, A., Sutskever, I., Hinton, G.E., 2017. Imagenet classification withdeep convolutional neural networks. Commun. ACM 60, 84–90. URL: http://doi.acm.org/10.1145/3065386 , doi: .Ledig, C., Theis, L., Huszar, F., Caballero, J., Cunningham, A., Acosta,A., Aitken, A.P., Tejani, A., Totz, J., Wang, Z., Shi, W., 2017. Photo-realistic single image super-resolution using a generative adversarial net-work, in: 2017 IEEE Conference on Computer Vision and Pattern Recog-nition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 105–114.doi: .Lin, G., Milan, A., Shen, C., Reid, I., 2017. RefineNet: Multi-path refinementnetworks for high-resolution semantic segmentation, in: CVPR.Lin, T., Cui, Y., Belongie, S.J., Hays, J., 2015. Learning deep representationsfor ground-to-aerial geolocalization, in: IEEE Conference on Computer Vi-sion and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12,2015, pp. 5007–5015. doi: .Lin, T.Y., Belongie, S., Hays, J., 2013. Cross-view image geolocalization,in: The IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Mathieu, M., Couprie, C., LeCun, Y., 2016. Deep multi-scale video predictionbeyond mean square error, in: 4th International Conference on LearningRepresentations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Con-ference Track Proceedings.MH, S.H.M.F.R., Lee, N.G.H., 2018. Cvm-net: Cross-view matching networkfor image-based ground-to-aerial geo-localization, in: Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pp. 7258–7267.Mirza, M., Osindero, S., 2014. Conditional generative adversarial nets. CoRRabs / http://arxiv.org/abs/1411.1784 .Nguyen, T., Le, T., Vu, H., Phung, D., 2017. Dual discriminator generativeadversarial nets, in: Advances in Neural Information Processing Systems,pp. 2670–2680.Palazzi, A., Borghi, G., Abati, D., Calderara, S., Cucchiara, R., 2017. Learningto map vehicles into bird’s eye view, in: Battiato, S., Gallo, G., Schettini, R.,Stanco, F. (Eds.), Image Analysis and Processing - ICIAP 2017, SpringerInternational Publishing, Cham. pp. 233–243.Park, E., Yang, J., Yumer, E., Ceylan, D., Berg, A.C., 2017. Transformation-grounded image generation network for novel 3d view synthesis. CoRRabs / http://arxiv.org/abs/1703.02921 .Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A., 2016. Con-text encoders: Feature learning by inpainting, in: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pp. 2536–2544.Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H., 2016. Gen-erative adversarial text to image synthesis, in: Balcan, M.F., Weinberger,K.Q. (Eds.), Proceedings of The 33rd International Conference on MachineLearning, PMLR, New York, New York, USA. pp. 1060–1069.Regmi, K., Borji, A., 2018. Cross-view image synthesis using conditionalgans, in: The IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR).Regmi, K., Shah, M., 2019. Bridging the domain gap for ground-to-aerial im-age matching. arXiv preprint arXiv:1904.11045 .Salakhutdinov, R., Hinton, G.E., 2009. Deep boltzmann machines., in: Interna-tional Conference on Artificial Intelligence and Statistics (AISTATS), p. 3.Salimans, T., Goodfellow, I.J., Zaremba, W., Cheung, V., Radford, A., Chen,X., 2016. Improved techniques for training gans, in: Advances in NeuralInformation Processing Systems 29: Annual Conference on Neural Infor-mation Processing Systems 2016, December 5-10, 2016, Barcelona, Spain,pp. 2226–2234.Shi, W., Caballero, J., Huszar, F., Totz, J., Aitken, A.P., Bishop, R., Rueck-ert, D., Wang, Z., 2016. Real-time single image and video super-resolution SSIM, PSNR and Sharpness Di ff erence between real data and samples generated using di ff erent methods on Dayton and
CVUSA datasets.
Dir. Methods Dayton (64 ×
64) Dayton (256 × (cid:29) SSIM ↑ PSNR ↑ SD ↓ SSIM PSNR SD SSIM PSNR SDZhai et al. (2017) – – – – – –
X-Seq – – –
X-SO 0.3254 16.5433 – – – g2a X-Fork
X-Seq 0.3663 20.4239 14.7657 0.2725 20.2925 16.9285 – – –
Table 7:
Quantitative evaluation of samples generated using di ff erent methods on SVA
Dataset in a2g direction.
Methods
X-Pix2pix X-SO X-Fork X-Seq H-Pix2pix H-SO H-Fork H-Seq H-Regions*Inception Score, all ↑ *Inception Score, Top-1 ↑ *Inception Score, Top-5 ↑ Accuracy (Top-1, all) ↑ ↑ ↑ ↑ || data) ↓ ↑ ↑ ↓ FID Score ↓ *Inception Score for real (ground truth) data is 3.0347, 2.3886 and 3.3446 for all, top-1 and top-5 setups respectively. using an e ffi cient sub-pixel convolutional neural network, in: 2016 IEEEConference on Computer Vision and Pattern Recognition, CVPR 2016,Las Vegas, NV, USA, June 27-30, 2016, pp. 1874–1883. URL: https://doi.org/10.1109/CVPR.2016.207 , doi: .Smolensky, P., 1986. Parallel distributed processing: Explorations in the mi-crostructure of cognition, vol. 1, MIT Press, Cambridge, MA, USA. chapterInformation Processing in Dynamical Systems: Foundations of HarmonyTheory, pp. 194–281.Song, L., Lu, Z., He, R., Sun, Z., Tan, T., 2017. Geometry guided adversarialfacial expression synthesis .Soran, B., Farhadi, A., Shapiro, L., 2014. Action recognition in the presence ofone egocentric and multiple static cameras, in: Asian Conference on Com-puter Vision, Springer. pp. 178–193.Szegedy, C., Vanhoucke, V., Io ff e, S., Shlens, J., Wojna, Z., 2016. Rethinkingthe inception architecture for computer vision, in: 2016 IEEE Conferenceon Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV,USA, June 27-30, 2016, pp. 2818–2826. URL: https://doi.org/10.1109/CVPR.2016.308 , doi: .Tatarchenko, M., Dosovitskiy, A., Brox, T., 2016. Multi-view 3d models fromsingle images with a convolutional network, in: Leibe, B., Matas, J., Sebe,N., Welling, M. (Eds.), Computer Vision – ECCV 2016, Springer Interna-tional Publishing, Cham. pp. 322–337.Theis, L., van den Oord, A., Bethge, M., 2016. A note on the evaluation of gen-erative models, in: International Conference on Learning Representations.Vo, N.N., Hays, J., 2016. Localizing and Orienting Street Views UsingOverhead Imagery. Springer International Publishing, Cham. pp. 494–509. URL: http://dx.doi.org/10.1007/978-3-319-46448-0_30 ,doi:10.1007/978-3-319-46448-0_30