Six-channel Image Representation for Cross-domain Object Detection
SSix-channel Image Representation for Cross-domainObject Detection
Tianxiao Zhang † , Wenchi Ma † , Guanghui Wang ‡ † Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS 66045, USA ‡ Department of Computer Science, Ryerson University, Toronto, ON, Canada M5B 2K3
Abstract —Most deep learning models are data-driven and theexcellent performance is highly dependent on the abundant anddiverse datasets. However, it is very hard to obtain and label thedatasets of some specific scenes or applications. If we train thedetector using the data from one domain, it cannot perform wellon the data from another domain due to domain shift, whichis one of the big challenges of most object detection models. Toaddress this issue, some image-to-image translation techniquesare employed to generate some fake data of some specific scenesto train the models. With the advent of Generative AdversarialNetworks (GANs), we could realize unsupervised image-to-imagetranslation in both directions from a source to a target domainand from the target to the source domain. In this study, we reporta new approach to making use of the generated images. Wepropose to concatenate the original 3-channel images and theircorresponding GAN-generated fake images to form 6-channelrepresentations of the dataset, hoping to address the domainshift problem while exploiting the success of available detectionmodels. The idea of augmented data representation may inspirefurther study on object detection and other applications.
Index Terms —object detection, domain shift, unsupervisedimage-to-image translation
I. I
NTRODUCTION
Computer vision has progressed rapidly with deep learningtechniques and more advanced and accurate models for objectdetection, image classification, image segmentation, pose esti-mation, and tracking emerging almost every day [31], [42],[49]. Even though computer vision enters a new era withdeep learning, there are still plenty of problems unsolvedand domain shift is one of them. Albeit CNN models aredominating the computer vision, their performances oftenbecome inferior when testing some unseen data or data from adifferent domain, which is denoted as domain shift. Since mostdeep learning models are data-driven and the high-accurateperformance is mostly guaranteed by the enormous amountof various data, domain shift often exists when there are notenough labeled specific data but we have to test those kinds ofdata in the testing set. For instance, although we only detectcars on the roads, training the models on day scenes cannotguarantee an effective detection of cars in the night scenes.We might have to utilize enough datasets from night scenesto train the models, nonetheless, sometimes the datasets fromsome specific scenes are rare or unlabeled, which makes iteven more difficult to mitigate the domain shift effect.To mitigate the situation where some kinds of training dataare none or rare, The image-to-image translation that couldtranslate images from one domain to another is highly desir- able. Fortunately, with the advent of Generative AdversarialNetworks (GANs) [15], Some researchers aim to generatesome fake datasets in specific scenes using GAN models toovercome the lack of those data. With some unpaired image-to-image translation GAN models (i.e., CycleGAN [52]), itcan not only translate images from the source domain to targetdomain, but also translate images from target domain to sourcedomain, and the entire process does not require any pairedimages, which make it ideal for real-world applications.The GAN models for image-to-image translation can gen-erate the corresponding fake images of the target domainfrom the original images of the source domain in the trainingdataset, and we can utilize the GAN-generated images to trainobject detection models and test on images of target domain[2]. Since we expect to solve cross-domain object detectionproblems, after pre-processing the data and generating the fakeimages with image-to-image translation models, the generateddata has to be fed into the object detection models to trainthe model and the trained model could demonstrate its ef-fectiveness through testing the data from the target domain.Employing GAN-generated fake images to train the detectionmodels to guarantee the domain of the training data andtesting data being the same illustrated the effectiveness of theapproach and the detection performance was boosted for thescenario where the training data for the detection models isfrom one domain while the testing data is in another domain[2].Instead of simply utilizing the fake images to train themodel, we propose to solve the problem from a new perspec-tive by concatenating the original images and their correspond-ing GAN-translated fake images to form new 6-channel repre-sentations. For instance, if we only have source domain imagesbut we intend to test our model on unlabeled images in thetarget domain, what we did was training the image-to-imagetranslation model with source domain data and target domaindata. And then we could employ the trained image translationmodel to generate the corresponding fake images. Since someimage-to-image translation models [52] could translate imagesin both directions, we are able to acquire the correspondingfake data for the data from both the source domain and targetdomain. Thus, both training images and testing images wouldbe augmented into 6-channel representations by concatenatingthe RGB three channels of the original images with those fromthe corresponding fake images. Then we can train and test thedetection models using available detection models, the only1 a r X i v : . [ c s . C V ] J a n mage-to-ImageTranslation CNN models Image-to-ImageTranslationreal train dayfake train night train test real test nightfake test dayNormalizationNormalizationConcatenation ground truth label NormalizationNormalizationConcatenationtest results Fig. 1. The flow chart of the proposed 6-channel image augmentation approach for training and testing CNN-based detection models. difference is the dimension of the kernel of the CNN modelsfor detection in the first layer becomes 6 instead of 3. Theprocess of training and testing the proposed method is depictedin Fig. 1. II. R
ELATED W ORK
Image-to-image translation is a popular topic in computervision [43], [44]. With the advent of Generative AdversarialNetworks [15], it could be mainly categorized as supervisedimage-to-image translation and unsupervised image-to-imagetranslation [1]. The supervised image-to-image translationmodels such as pix2pix [20] and BicycleGAN [54], require im-age pairs from two or more domains (i.e., the exact same imagescenes from day and night), which are extremely expensive andunrealistic to be acquired in the real world. Perhaps the qualityof the translated images is sometimes beyond expectations,they are not ideal for real-world applications.The unsupervised image-to-image translation models can bedivided as cycle consistency based models (i.e., CycleGAN[52], DiscoGAN [22], DualGAN [46]) which introduce cycleconsistency losses, autoencoder based models (i.e., UNIT [29])combined with autoencoder [23], and recent disentangled rep-resentation models (i.e., MUNIT [18], DIRT [25]). Since theunsupervised image-to-image translation models only requireimage sets from two or more domains and do not necessitateany paired images which are arduous to collect and annotate,they are often leveraged to generate some fake data in thetarget domain and applied to other computer vision tasks suchas object detection and image classification. Among thoseunsupervised image-to-image translation models, CycleGAN[52] is frequently utilized as the image-mapping model togenerate some fake data to be employed in some cross-domainproblems [19] [2].Object detection addresses the problem that detects the se-mantic instances on digital images or videos. The fundamental purpose of object detection is to classify the objects shown onthe images or videos and simultaneously locate those objectsby coordinates [32]. The applications of object detection are invarious fields such as medical image analysis [33], self-drivingcar, pose estimation, segmentation, etc.From the perspective of stages, the object detectors arecategorized into two types: one-stage detectors and two-stagedetectors. For two-stage object detectors such as Faster R-CNN [37], MS-CNN [4], R-FCN [8], FPN [26], these modelsare often comprised of a region proposal network as thefirst stage that selects the candidate anchors which have highprobabilities to contain objects and a detection network asthe second stage that classify the objects to be contained bythese candidates and further do the bounding box regressionfor these candidates to refine their coordinates and finallyoutput the results. For one-stage object detectors like SSD[30], YOLOv1-v4 [34] [35] [36] [3], RetinaNet [27], thesedetectors often directly classify and regress the pre-definedanchor boxes instead of choosing some candidates. Thus thetwo-stage models often outperform the one-stage counterpartswhile one-stage models frequently have a faster inference ratethan two-stage approaches.Due to the various sizes and shapes of the objects, somemodels [30] [26] [27] [48] design anchor boxes on differentlevels of feature maps (the pixels on lower level feature mapshave a small receptive field and the pixels on higher-levelfeature maps have large receptive field) so that the anchorson lower level features are responsible for the relative smallobjects and the anchors on higher-level features are in chargeof detecting relatively large objects. The middle-sized objectsare perhaps recognized by the middle-level feature maps.The aforementioned detection models are anchor-based thatwe have to design pre-defined anchor boxes for these models.In recent years, some anchor-free models [51] [50] [10] [39][24] are attracting great attention for their excellent perfor-2ance without any pre-defined anchor boxes. Some of themare even dominating the accuracy on COCO benchmark [28].Since a large amount of anchors has to be generated for someanchor-based models and most of them are useless becauseno object is contained in the majority of anchors, anchor-freemodels might predominate in the designs of object detectors inthe future. Recently, the transformer [40] is applied success-fully to object detection [5], which is an anchor-free modelwith attention mechanisms.Nonetheless, many problems have not been well solved inthis field, especially in cross-domain object detection. Sincemodern object detectors are based on deep learning techniquesand deep learning is data-driven so that the performance ofmodern object detectors is highly dependent on how manyannotated data can be employed as the training set. Cross-domain issues arise when there are not enough labeled trainingdata that has the same domain as the testing data, or the datasetis diverse or composed of various datasets of different domainsin both training and testing data.Domain Adaptive Faster R-CNN [6] explores the cross-domain object detection problem based on Faster R-CNN. Byutilizing Gradient Reverse Layer (GRL) [12] in an adversarialtraining manner which is similar to Generative Adversar-ial Networks (GAN) [15], this paper proposes an image-level adaptation component and an instance-level adaptationcomponent which augment the Faster R-CNN structure torealize domain adaptation. In addition, a consistent regularizerbetween those two components is to alleviate the effects of thedomain shift between different dataset such as KITTI [13],Cityscapes [7], Foggy Cityscapes [38], and SIM10K [21].Universal object detection by domain attention [41] ad-dresses the universal object detection of various datasets byattention mechanism [40]. The universal object detection isarduous to realize since the object detection datasets arediverse and there exists a domain shift between them. Thepaper [17] proposes a domain adaption module which iscomprised of a universal SE adapter bank and a new domain-attention mechanism to realize universal object detection. [19]deals with cross-domain object detection that instance-levelannotations are accessible in the source domain while onlyimage-level labels are available in the target domain. Theauthors exploit an unpaired image-to-image translation model(CycleGAN [52]) to generate fake data in the target domain tofine-tune the trained model which is trained on the data in thesource domain. Finally, the model is fine-tuned again on thedetected results of the testing data (pseudo-labeling) to makethe model even better.The study [2] utilizes CycleGAN [52] as the image-to-imagetranslation model to translate the images in both directions.The model trained on the fake data in the target domain hasbetter performance than that trained on the original data inthe source domain on testing the test data from the targetdomain. The dataset we employ in this paper is from [2]and we follow exactly the same pre-processing procedure toprepare the dataset. In the following, we will discuss ourproposal that utilizes concatenated image pairs (real images and corresponding fake images) to train the detection modeland compare it to the corresponding approach from [2].III. P
ROPOSED A PPROACH
The framework of our proposed method is depicted in Fig.1. In our implementation, we employ CycleGAN for image-to-image translation, which is trained with the data from thesource domain (i.e., day images) and the data from the targetdomain (i.e., night images). First, the fake data (target domain)is generated from the original data (source domain) via thetrained image-to-image translation model (i.e., generating thefake night images from the real day images). Then, the real andfake images are normalized and concatenated (i.e., concatenat-ing two 3-channel images to form a 6-channel representationof the image). Finally, the concatenated images are exploitedto train the CNN models. During the stage of test, the testdata is processed in a similar way as the training data to formconcatenated images and sent to the trained CNN model fordetection.
A. Image-to-image Translation
To realize the cross-domain object detection, we have tocollect and annotate the data in the target domain to train themodel. While it is difficult to acquire the annotated data in thetarget domain, image-to-image translation models provide anoption to generate fake data in the target domain.In our experiment, we employed an unpaired image-to-image translation model: CycleGAN [52]. CycleGAN is anunsupervised image-to-image translation that only requiresimages from two different domains (without any image-levelor instance-level annotations) to train the model. Further-more, unpaired translation illustrates that the images fromtwo domains do not need to be paired which is extremelydemanding to be obtained. Last but not least, the locationsand sizes of the objects on the images should be the sameafter the image-to-image translation so that any image-levellabels and instance-level annotations of the original images canbe utilized directly on the translated images. This property isextraordinarily significant since most CNN models are data-driven and the annotations of the images are indispensableto successfully train the supervised CNN models (i.e., mostobject detection models). Unpaired image-to-image translationmodels such as CycleGAN [52] can translate the images in twodirections without changing the key properties of the objectson the images. Thus the annotations such as coordinates andclass labels of the objects on the original images can besmoothly exploited in the fake translated images. As manuallyannotating the images is significantly expensive, by image-to-image translation, the translated images would automaticallyhave the same labels as their original counterparts, which tosome extent makes manually annotating images unnecessary.
B. CNN models
In Fig. 1, the CNN model can be any CNN-based objectdetection models, where the dimension of the convolutionalkernel in the first layer is changed from 3 to 6. In our3 ig. 2. Several samples of original-day images (1st row) and their corresponding GAN-generated fake-night images (2nd row).Fig. 3. Several samples of original-night images (1st row) and their corresponding GAN-generated fake-day images (2nd row). implementation, we employ Faster R-CNN [37] for detection,and we use ResNet-101 [16] as the backbone network for thedetection model.Faster R-CNN is a classic two-stage anchor-based objectdetector that is comprised of Region Proposal Network (RPN)and detection network. Since it is an anchor-based model,we have to design some pre-defined anchor boxes on thefeature maps. Typically, 9 anchors with 3 different sizes and3 different aspect ratios are designed to act as the pre-definedanchor boxes on each location of the feature maps. Theobjective of RPN is to select some region proposals with a highprobability of containing objects from the pre-defined anchorsand further refine their coordinates. Each pre-defined anchorwould be associated with a score indicating the probabilityof that anchor box containing an object. Only the anchorboxes with associated scores higher than some threshold canbe selected as region proposals and those region proposalsare further refined by RPN and later fed into the detectionnetwork.The purpose of the detection network is to receive the regionproposals selected and refined by RPN and finally do theclassification for each rectangle proposal and bounding boxregression to improve the coordinates of the box proposals.Since the region proposals may have various sizes and shapes,more accurately, the number of elements each proposal hasmight be varying. To guarantee the region proposals are fedinto the fully connected layers effectively (the fully connectedlayer needs the length of input data fixed), the ROI poolinglayer is adopted to insure the size of the input of eachproposal to the detection network is fixed. The detection network is simply from Fast R-CNN [14] that is to classifythe object which might be contained by each region proposaland simultaneously refine the coordinates of the rectangleboxes. The output of Faster R-CNN network is the class ofthe object each proposal might include and the coordinates ofthe bounding box for each refined proposal.IV. E
XPERIMENTS
In this section, the datasets and the experimental methodol-ogy and parameter settings are elaborated. We conducted someof the experiments from [2] for comparison.
A. Datasets
We employ the same dataset as [2] in our experiments.The original datasets are from BDD100K [47] which is alarge-scale diverse dataset for driving scenes. Since the datasetis extremely large and contains high-resolution images andvarious scenarios on the road and the weather conditions(sunny, rainy, foggy, etc.) [2], the authors only choose theclear or partly cloudy day and night images to demonstrate thedomain shift from day to night [2]. In addition, all selectedimages are cropped to 256 ×
256 pixels with proper adjustment.There are a total 12,000 images left and processed (6,000 dayimages and 6,000 night images). After that, the images arerandomly sampled and divided into four sets: train-day, train-night, test-day, and test-night, each of the sets contains 3,000256 ×
256 images. We harness the set of train-day and train-night to train the CycleGAN model and utilized the trainedGAN model to generate fake train-night (from train-day), faketrain-day (from train-night), fake test-night (from test-day),4nd fake test-day (from test-night). Now we have a total of12,000 real images (3,000 for each set) and 12,000 fake images(3,000 for each set). Then we can concatenate the real imagesand their corresponding fake images to generate 6-channelrepresentations that would be fed into the Faster R-CNN objectdetector. After choosing and processing the images, the car isthe only object on the image to be detected. Some samplesof real images and their corresponding GAN-generated fakecounterparts are illustrated in Fig. 2 and Fig. 3.
B. Experimental Evaluations
Faster R-CNN model is implemented in Python [45] withPytorch 1.0.0 and CycleGAN is implemented in Python [53]with PyTorch 1.4.0. All experiments are executed with CUDA9.1.85 and cuDNN 7 on a single NVIDIA TITAN XP GPUwith a memory of 12 GB.The metric we employed is mean Average Precision (mAP)from PASCAL VOC [11], which is the same metric employedin [2]. Since the car is the only object to be detected, the mAPis equivalent to AP in this dataset since mAP calculating themean AP for all classes.For CycleGAN, the parameters are default values in [53].For Faster R-CNN, similarly to [2], we utilize pre-trainedResNet-101 [16] on ImageNet [9] as our backbone network.We select the initial learning rates from 0.001 to 0.00001and the experiments are implemented separately for thosechosen initial learning rates, but we do not utilize them all foreach experiment since our experiments demonstrate that thehigher the learning rate we selected from above, the better theresults would be. In each 5 epoch, the learning rate decaysas 0.1 of the previous learning rate. The training processwould be executed 20 to 30 epochs, but the results indicatethat the Faster R-CNN model converges relatively early onthe dataset. Training every 5 epochs, we record the testingresults on test data, but we would report the best one for eachexperiment. The model parameters are the same for 6-channelexperiments and 3-channel experiments, except for 6-channelexperiments, the kernel dimension of the first layer of FasterR-CNN model is 6 instead of 3. And we just concatenate eachkernel by itself to create 6-dimension kernels in the first layerof ResNet-101 backbone for 6-channel experiments. While for3-channel experiments, we simply exploit the original ResNet-101 backbone as our initial training parameters.
C. Experimental Results
First, we implemented the training and testing of the original3-channel Faster R-CNN model which is illustrated in Table I.The test set is test-night data which is fixed. With differenttraining sets, the detection results on test night are varying.From Table I we can see that, for testing the test-night set,the model trained on the fake-train night set is much better thanthat trained on the original train-day set, which corresponds tothe results from [2]. These experimental results indicate thatif the annotated day images are the only available trainingdata while the test set contains only night images, we couldleverage fake night images generated by the image-to-image
TABLE I3-
CHANNEL DETECTION
Train set mAPtrain-day (3,000 images) 0.777fake train-night (3,000 images) 0.893train-night (3,000 images) 0.933train-day + train-night (6,000 images) 0.941 translation models to train the CNN model. The results areexcellent when the model is trained on the train-night set(without domain shift), indicating the domain shift is the mostsignificant influence on the performance of the CNN model inthis experiment.Then we conduct the experiments for our proposed 6-channel Faster R-CNN model which is shown in Table II. Thetest data is comprised of test-night images concatenated withcorresponding translated fake test-day images. The trainingsets in Table II have 6 channels. For instance, train-day in thetable indicates train-day images concatenated with correspond-ing fake train-night images, and train-day plus train-nightin the table represents train-day images concatenated withcorresponding fake train-night images plus train-night imagesconcatenated with corresponding fake train-day images.
TABLE II6-
CHANNEL DETECTION
Train set mAPtrain-day (3,000 6-channel representations) 0.830train-night (3,000 6-channel representations) 0.931train-day + train-night (6,000 6-channel representa-tions) 0.938
From Table I and Table II, it is noticeable that even thoughthe model trained on train-day images concatenated with faketrain-night images (6-channel) has a better result with AP0.830 than that just training on train-day (3-channel) with AP0.777, it is worse than the model only trained on fake train-night (3-channel) with AP 0.893.To demonstrate if the 6-channel approach can improve thedetection results in the situation where the training set andtesting set do not have domain shift, we also performedthe experiment that trains the model on train-night set (3-channel) and tests it on test-night set. From Table I, theaverage precision is 0.933, which is pretty high since thereis no domain shift between the training data and testing data.Accordingly, we did the corresponding 6-channel experimentwhich trains on train-night set concatenated with fake train-day set and tests it on test-night images concatenated withfake test-day images. From Table II, the average precision ofthis 6-channel model is almost the same as its corresponding3-channel model.We increase the size of the training data by training themodel with the train-day set plus the train-night set and testingit on test-night data. From Table I and Table II, the resultof 6-channel model also performs similar to its 3-channel5ounterpart. More experimental results are shown in Table III,which are from the original 3-channel models. To remove theeffect of domain shift, the training set and the testing setdo not have domain shift (they are all day images or nightimages). From Table III, it is obvious that the ”quality” shiftinfluences the performance of the models. For instance, themodel trained on the original train-day (or train-night) set hasbetter performance on the original test-day (or test-night) setthan the GAN-generated fake day (or night) images. Similarly,the model which is trained on GAN-generated fake train-day(or fake train-night) set performs better on the GAN-generatedfake test-day (or fake test-night) set than the original test-day(or test-night) set.
TABLE III3-
CHANNEL EXTRA EXPERIMENTS
Train set Test set mAPtrain-day test-day 0.945fake test-day 0.789fake train-day fake test-day 0.914test-day 0.903train-night test-night 0.932fake test-night 0.859fake train-night fake test-night 0.924test-night 0.868
V. C
ONCLUSION
The study has evaluated a 6-channel approach to addressthe domain-shift issue by incorporating the generated fakeimages using image-to-image translation. However, we havenot achieved the expected results. One possible reason is thequality of the generated images is inferior compared to theoriginal images, especially the fake day images generated fromthe data of night scenes, as illustrated in Fig. 2 and Fig.3. If we merely concatenate the original high-quality imageswith their inferior counterparts, the model may treat the low-quality fake image channels as some kind of ”noise”, and thus,the model could hardly learn more useful information fromthe concatenated 6-channel representations. Another possiblereason is that the domain shift issue may still exist in thecombined 6-channel representations, which prevents the modelfrom extracting useful information from the concatenated rep-resentations. Moreover, the dataset we used in the experimentsonly has limited samples, which are insufficient to train themodel. We hope the idea of augmented data representation caninspire more further investigations and applications.R
EFERENCES[1] A. Alotaibi, “Deep generative adversarial networks for image-to-imagetranslation: A review,”
Symmetry , vol. 12, no. 10, p. 1705, 2020.[2] V. F. Arruda, T. M. Paix˜ao, R. F. Berriel, A. F. De Souza, C. Badue,N. Sebe, and T. Oliveira-Santos, “Cross-domain car detection usingunsupervised image-to-image translation: From day to night,” in . IEEE,2019, pp. 1–8.[3] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Op-timal speed and accuracy of object detection,” arXiv preprintarXiv:2004.10934 , 2020. [4] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos, “A unified multi-scaledeep convolutional neural network for fast object detection,” in
Europeanconference on computer vision . Springer, 2016, pp. 354–370.[5] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, andS. Zagoruyko, “End-to-end object detection with transformers,” arXivpreprint arXiv:2005.12872 , 2020.[6] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool, “Domain adaptivefaster r-cnn for object detection in the wild,” in
Proceedings of the IEEEconference on computer vision and pattern recognition , 2018, pp. 3339–3348.[7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be-nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes datasetfor semantic urban scene understanding,” in
Proceedings of the IEEEconference on computer vision and pattern recognition , 2016, pp. 3213–3223.[8] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-based fully convolutional networks,” arXiv preprint arXiv:1605.06409 ,2016.[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in . Ieee, 2009, pp. 248–255.[10] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “Centernet:Keypoint triplets for object detection,” in
Proceedings of the IEEEInternational Conference on Computer Vision , 2019, pp. 6569–6578.[11] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn,and A. Zisserman, “The pascal visual object classes challenge: Aretrospective,”
International journal of computer vision , vol. 111, no. 1,pp. 98–136, 2015.[12] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by back-propagation,” in
International conference on machine learning . PMLR,2015, pp. 1180–1189.[13] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:The kitti dataset,”
The International Journal of Robotics Research ,vol. 32, no. 11, pp. 1231–1237, 2013.[14] R. Girshick, “Fast r-cnn,” in
Proceedings of the IEEE internationalconference on computer vision , 2015, pp. 1440–1448.[15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
Advances in neural information processing systems , 2014, pp. 2672–2680.[16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[17] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
Proceedings of the IEEE conference on computer vision and patternrecognition , 2018, pp. 7132–7141.[18] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz, “Multimodal unsu-pervised image-to-image translation,” in
Proceedings of the EuropeanConference on Computer Vision (ECCV) , 2018, pp. 172–189.[19] N. Inoue, R. Furuta, T. Yamasaki, and K. Aizawa, “Cross-domainweakly-supervised object detection through progressive domain adap-tation,” in
Proceedings of the IEEE conference on computer vision andpattern recognition , 2018, pp. 5001–5009.[20] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translationwith conditional adversarial networks,” in
Proceedings of the IEEEconference on computer vision and pattern recognition , 2017, pp. 1125–1134.[21] M. Johnson-Roberson, C. Barto, R. Mehta, S. N. Sridhar, K. Rosaen,and R. Vasudevan, “Driving in the matrix: Can virtual worlds replacehuman-generated annotations for real world tasks?” arXiv preprintarXiv:1610.01983 , 2016.[22] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learning to discovercross-domain relations with generative adversarial networks,” arXivpreprint arXiv:1703.05192 , 2017.[23] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXivpreprint arXiv:1312.6114 , 2013.[24] H. Law and J. Deng, “Cornernet: Detecting objects as paired key-points,” in
Proceedings of the European Conference on Computer Vision(ECCV) , 2018, pp. 734–750.[25] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and M.-H. Yang,“Diverse image-to-image translation via disentangled representations,”in
Proceedings of the European conference on computer vision (ECCV) ,2018, pp. 35–51.
26] T.-Y. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie,“Feature pyramid networks for object detection,” in
Proceedings of theIEEE conference on computer vision and pattern recognition , 2017, pp.2117–2125.[27] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Focal lossfor dense object detection,” in
Proceedings of the IEEE internationalconference on computer vision , 2017, pp. 2980–2988.[28] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Doll´ar, and C. L. Zitnick, “Microsoft coco: Common objects incontext,” in
European conference on computer vision . Springer, 2014,pp. 740–755.[29] M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-imagetranslation networks,” in
Advances in neural information processingsystems , 2017, pp. 700–708.[30] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.Berg, “Ssd: Single shot multibox detector,” in
European conference oncomputer vision . Springer, 2016, pp. 21–37.[31] W. Ma, K. Li, and G. Wang, “Location-aware box reasoning for anchor-based single-shot object detection,”
IEEE Access , vol. 8, pp. 129 300–129 309, 2020.[32] W. Ma, Y. Wu, F. Cen, and G. Wang, “Mdfn: Multi-scale deep featurelearning network for object detection,”
Pattern Recognition , vol. 100, p.107149, 2020.[33] X. Mo, K. Tao, Q. Wang, and G. Wang, “An efficient approach forpolyps detection in endoscopic videos based on faster r-cnn,” in . IEEE,2018, pp. 3929–3934.[34] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only lookonce: Unified, real-time object detection,” in
Proceedings of the IEEEconference on computer vision and pattern recognition , 2016, pp. 779–788.[35] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in
Proceedings of the IEEE conference on computer vision and patternrecognition , 2017, pp. 7263–7271.[36] ——, “Yolov3: An incremental improvement,” arXiv preprintarXiv:1804.02767 , 2018.[37] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-timeobject detection with region proposal networks,”
IEEE transactions onpattern analysis and machine intelligence , vol. 39, no. 6, pp. 1137–1149,2016.[38] C. Sakaridis, D. Dai, and L. Van Gool, “Semantic foggy scene under-standing with synthetic data,”
International Journal of Computer Vision ,vol. 126, no. 9, pp. 973–992, 2018.[39] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutionalone-stage object detection,” in
Proceedings of the IEEE internationalconference on computer vision , 2019, pp. 9627–9636. [40] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in
Advancesin neural information processing systems , 2017, pp. 5998–6008.[41] X. Wang, Z. Cai, D. Gao, and N. Vasconcelos, “Towards universal objectdetection by domain attention,” in
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2019, pp. 7289–7298.[42] Y. Wu, Z. Zhang, and G. Wang, “Unsupervised deep feature transferfor low resolution image classification,” in
Proceedings of the IEEEInternational Conference on Computer Vision Workshops , 2019, pp. 0–0.[43] W. Xu, S. Keshmiri, and G. Wang, “Stacked wasserstein autoencoder,”
Neurocomputing , vol. 363, pp. 195–204, 2019.[44] W. Xu, K. Shawn, and G. Wang, “Toward learning a unified many-to-many mapping for diverse image translation,”
Pattern Recognition ,vol. 93, pp. 570–580, 2019.[45] J. Yang, J. Lu, D. Batra, and D. Parikh, “A faster pytorch implementationof faster r-cnn,” https://github.com/jwyang/faster-rcnn.pytorch , 2017.[46] Z. Yi, H. Zhang, P. Tan, and M. Gong, “Dualgan: Unsupervised duallearning for image-to-image translation,” in
Proceedings of the IEEEinternational conference on computer vision , 2017, pp. 2849–2857.[47] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell,“Bdd100k: A diverse driving video database with scalable annotationtooling,” arXiv preprint arXiv:1805.04687 , vol. 2, no. 5, p. 6, 2018.[48] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, “Single-shot refinementneural network for object detection,” in
Proceedings of the IEEEconference on computer vision and pattern recognition , 2018, pp. 4203–4212.[49] X. Zhang, T. Zhang, Y. Yang, Z. Wang, and G. Wang, “Real-time golfball detection and tracking based on convolutional neural networks,” in . IEEE, 2020, pp. 2808–2813.[50] X. Zhou, J. Zhuo, and P. Krahenbuhl, “Bottom-up object detectionby grouping extreme and center points,” in
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2019, pp. 850–859.[51] C. Zhu, Y. He, and M. Savvides, “Feature selective anchor-free modulefor single-shot object detection,” in
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2019, pp. 840–849.[52] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-imagetranslation using cycle-consistent adversarial networks,” in
Proceedingsof the IEEE international conference on computer vision , 2017, pp.2223–2232.[53] J.-Y. Zhu, T. Park, and T. Wang, “Cyclegan and pix2pix in pytorch,” https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix , 2020.[54] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang,and E. Shechtman, “Toward multimodal image-to-image translation,” in
Advances in neural information processing systems , 2017, pp. 465–476., 2017, pp. 465–476.