[PDF] Addressing Visual Search in Open and Closed Set Settings

Abstract

Searching for small objects in large images is a task that is both challenging for current deep learning systems and important in numerous real-world applications, such as remote sensing and medical imaging. Thorough scanning of very large images is computationally expensive, particularly at resolutions sufficient to capture small objects. The smaller an object of interest, the more likely it is to be obscured by clutter or otherwise deemed insignificant. We examine these issues in the context of two complementary problems: closed-set object detection and open-set target search. First, we present a method for predicting pixel-level objectness from a low resolution gist image, which we then use to select regions for performing object detection locally at high resolution. This approach has the benefit of not being fixed to a predetermined grid, thereby requiring fewer costly high-resolution glimpses than existing methods. Second, we propose a novel strategy for open-set visual search that seeks to find all instances of a target class which may be previously unseen and is defined by a single image. We interpret both detection problems through a probabilistic, Bayesian lens, whereby the objectness maps produced by our method serve as priors in a maximum-a-posteriori approach to the detection step. We evaluate the end-to-end performance of both the combination of our patch selection strategy with this target search approach and the combination of our patch selection strategy with standard object detection methods. Both elements of our approach are seen to significantly outperform baseline strategies.

Full PDF

OObjectness-Guided Open Set Visual Search and Closed Set Detection

Nathan Drenkow, Philippe Burlina, Neil Fendley, Kachi Odoemene, and Jared MarkowitzThe Johns Hopkins University Applied Physics Laboratory

Abstract

Searching for small objects in large images is currentlychallenging for deep learning systems, but is a task with nu-merous applications including remote sensing and medicalimaging. Thorough scanning of very large images is com-putationally expensive, particularly at resolutions sufﬁcientto capture small objects. The smaller an object of interest,the more likely it is to be obscured by clutter or otherwisedeemed insigniﬁcant. We examine these issues in the contextof two complementary problems: closed-set object detec-tion and open-set target search. First, we present a methodfor predicting pixel-level objectness from a low resolutiongist image, which we then use to select regions for subse-quent evaluation at high resolution. This approach has thebeneﬁt of not being ﬁxed to a predetermined grid, allowingfewer costly high-resolution glimpses than existing meth-ods. Second, we propose a novel strategy for open-set visualsearch that seeks to ﬁnd all objects in an image of the sameclass as a given target reference image. We interpret bothdetection problems through a probabilistic, Bayesian lens,whereby the objectness maps produced by our method serveas priors in a maximum-a-posteriori approach to the detec-tion step. We evaluate the end-to-end performance of boththe combination of our patch selection strategy with thistarget search approach and the combination of our patchselection strategy with standard object detection methods.Both our patch selection and target search approaches areseen to signiﬁcantly outperform baseline strategies.

1. Introduction

Artiﬁcial intelligence (AI) approaches relying on deeplearning (DL) and primarily using deep convolutional neu-ral networks (CNNs) have recently shown great successat numerous tasks. In problems such as image classiﬁca-tion [14, 12], object detection [21, 19], image segmenta-tion [16, 22] and application areas such as medical diag-nostics [5, 16], AI approaches can meet or exceed humancapabilities. Among other things, current AI/DL researchaddresses issues such as open set recognition [24, 9] (one of

Figure 1. Conventional approaches use image tiling or slidingwindows to ensure coverage while keeping tile dimensions smallenough to be processed by standard object detectors. We proposeto use predicted objectness to achieve the same end while mini-mizing the total number of high-resolution windows or glimpsesrequired. the foci of this study), privacy [26], adversarial attacks [6],low-shot learning [18, 4], and AI bias [3]. This study isspeciﬁcally focused on the task of using CNNs for detect-ing objects, where much progress has been made throughalgorithms such as Yolo or Fast(er)-RCNN [21, 19, 20, 28].However, most object detection work has thus far relied onthe assumption that the target objects occupy a signiﬁcantfraction of the search area. We consider instead the sit-uation where objects may be several orders of magnitudesmaller than the image size (e.g. images with thousands ofpixels per side and relevant objects spanning only tens ofpixels). This occurs frequently in remote sensing applica-tions, notably in satellite imagery (e.g., vehicles in a park-ing lot) and microscopy images (e.g., synapses in electronmicroscopy imaging of brain tissue [27]). Most machinevision techniques operate on images on the order of a fewhundred pixels per side, for example the × pixelimages of ImageNet [7]. Processing very high resolution(vHR) images requires additional computation, time, andmoney, particularly in streaming applications. To be useful,approaches to this problem must efﬁciently manage trade-offs among computational cost, memory, and performancein an application-speciﬁc manner.Thoroughly searching such data sources for objects at1 a r X i v : . [ c s . C V ] D ec arying scales motivates the development of unconventionalapproaches. Standard techniques relying on convolutionalneural networks involve processing sliding windows of fullimages. The computational cost of this process increasesquadratically with the size of the image, incurring largememory and computational footprints that eventually be-come prohibitive. In practice, the computational budgetmay be ﬁxed, leading to choices about how to prioritizewindow selection.Further, a given architecture may struggle to reasonacross sufﬁciently diverse scales, leading to potential ne-glect of target candidates. We thus have two goals for thiswork: (1) to achieve high performance for object search anddetection in vHR images and (2) to develop methods thatscale according to data and computational constraints.To address these issues, we sought an efﬁcient approachto more efﬁciently “look” for objects belonging to classesof interest (Figure 1). Recent work [30] has tackled aspectsof this problem using deep reinforcement learning (DRL).We instead address it through an approach that identiﬁes re-gions of high potential objectness . We use a probabilisticinterpretation of object detection in vHR images wherebypredicted objectness maps act as priors for detection al-gorithms, allowing us to perform search in a maximum-a-posteriori setting. In this work, we examine two differ-ent detection scenarios: (1) closed-set object detection ,whereby we search efﬁciently for instances of a ﬁxed num-ber of pre-identiﬁed object classes, and (2) open-set target-guided search , where the algorithm must ﬁnd instances ofa target class deﬁned by only a single image.

2. Related Work

Numerous works have addressed the problem of objectdetection in visual data through the use of convolutionalneural networks. R-CNN [11] and Fast-RCNN [10] rely onselective search to identify proposal regions, while Faster-RCNN [21] jointly identiﬁes proposal regions and theirclasses. The YOLO family of algorithms [19, 20, 2], onthe other hand, pass the entire image to a detection network.They produce bounding boxes and object probabilities in asingle pass, leading to superior speed. However, they arenot amenable to the vHR images we consider here. Morerecently, EfﬁcientDet [28] used several algorithmic inno-vations to provide state-of-the-art detection performance ina quantiﬁably more efﬁcient manner. However, like priormethods, EfﬁcientDet struggles to detect small objects inlarge, cluttered scenes. While all of these methods performwell on natural imagery, they are not immediately amenableto vHR images. vHR images would need to be heavilydownsampled or at a minimum tiled just to enable process-ing through the standard architecture and on conventionalGPU hardware.Early work in applying object detectors based on deep learning to overhead imagery [23, 8, 31, 25, 17, 15, 29]has focused on challenges including very large changes inscale, the need for rotation invariance, and limited amountsof training data. While progress has been made in ad-dressing these challenges, the issue of image resolution haslargely been left to naive windowing/tiling or multi-scaleapproaches. These methods show improved detection re-sults on large images but do not address efﬁciency concerns.Recent work [30] has allowed for targeted object detec-tion in vHR images using DRL. This approach uses a two-stage selection process on ﬁxed grids of potential searchregions as a way to address efﬁciency challenges. Eachhigh-resolution (HR) grid tile is either downsampled or pro-cessed natively by a conventional detection network, witha learned policy being used to make the low- vs. high-resolution determination. The DRL agent is trained tochoose which regions of an image to process at low resolu-tion and which regions to process at high resolution in orderto best balance efﬁciency with detection performance.Here we provide an alternative approach that allows ﬂex-ible sampling of HR windows (also referred to as glimpses )as an alternative to a ﬁxed-grid approach. Furthermore, wedevelop an approach for estimating objectness from low-resolution (LR) gist images that then guides our glimpse-sampling approach. Finally, we demonstrate how this ap-proach can be used for open-set search in a MAP frame-work.

3. Methods

We discuss complementary approaches to object detec-tion in vHR overhead images. In particular, we address ob-ject detection in closed and open-set scenarios, using esti-mates of objectness from LR gists to guide the search anddetection processes.

We ﬁrst develop an approach that allows our model to ex-amine the full scene at low resolution (and with low compu-tation cost), producing a saliency map to guide subsequenthigh-resolution object detection. Speciﬁcally, we train adeep neural network (DNN) to produce a density map of objectness ; that is, a map that encodes the likelihood of re-gions containing objects from the classes we wish to detect.In practice, given a high-resolution ground-truth class-levelsemantic segmentation, S c , we generate a binary segmenta-tion S b by converting the class-level labels to binary labelsthat represent the presence or absence of an object.To train the prediction model, we generate LR gist im-ages I g from the original vHR image I vHR (e.g., we resize a4000x4000 image to 128x128) and similarly scale S b to geta gist mask S g of ﬁxed dimensions, ( w gist , h gist ) . For conve-2ience and following common practice, we assume imagesare square (i.e., w gist = h gist = d gist ). The scale factor, α , isdeﬁned as α = d gist d vHR . (1)As a result of the downsampling, we expect the gist imagesto lose some class-level information and potentially evenreduce small objects beyond recognition (e.g., one pixel).However, our intent is to produce density maps that cap-ture object likelihood rather than a true segmentation. Weﬁnd that the gist images still preserve sufﬁcient context tosupport those predictions (e.g., textures, co-occurring largerobjects, etc.).Using the gist images, we train a U-Net [22] to produceobjectness density maps π from the gist images. We deﬁneour training loss to be the binary cross-entropy between thetrue object mask and the prediction, L π = − (cid:88) i,j ∈ S g y ij log( π ij ) − (1 − y ij ) log(1 − π ij ) . (2)Here y ij is the true objectness value at pixel ( i, j ) and π ij is the objectness prediction at that same location.In case ground-truth segmentation is unavailable, an al-ternative is to convert bounding box annotations to Gaussiandensities of the same shape and normalized to have a peakof 1 at the center. Supplemental experiments demonstratedthis to be an effective alternative when high-quality groundtruth is sparse. Given the objectness maps as a prior, we seek to select aset of high-resolution regions to glimpse over. While thereare many strategies for performing the selection, our gen-eral process is as follows. We ﬁrst deﬁne the appropri-ate glimpse image dimensions, as determined by the down-stream object detector. Assuming square input images asbefore, we use d glimpse to denote the ﬁnal glimpse reso-lution. From this deﬁnition, we can determine the corre-sponding size of a glimpse in the gist image via d (cid:48) glimpse = ceil ( α · d glimpse ) .Given the glimpse size relative to the gist image, we seekto deﬁne a policy for sampling glimpses that maximizes thedetection of objects of interest. We initially focus on rule-based policies that are deterministic and interpretable. Oursimple yet effective policy is described in Algorithm 1. Themethod iteratively samples glimpses that maximize the totalavailable objectness. This strategy allows for glimpses tooverlap each other, but can be easily modiﬁed by increasingor decreasing the β term (for instance, β = − ( d (cid:48) glimpse ) prevents any glimpse overlap). Algorithm 1:

Glimpse Selection Policy

Result:

Selection of a set of glimpses

Input : π : Objectness map prior of dimension d gist d glimpse : Glimpse dimension in HRimage-space α : Gist scaling factor n glimpse : Maximum number of glimpses β : Objectness penalty term [default=0] Output: G : Set of selected glimpse locations Compute d (cid:48) glimpse = ceil ( α · d glimpse ) ; Compute σ = d (cid:48) glimpse ; Generate Gaussian kernel, k with width σ anddimension d (cid:48) glimpse ; Compute π (cid:48) = conv ( π, k ) ; G = {} ; for i = 1 : n glimpse do Generate S π (cid:48) according to (3); Compute p x,y according to (4); Add p x,y to G ; Reset objectness at glimpse location π (cid:48) [ p x : p x + d (cid:48) LR , p y : p y + d (cid:48) LR ] = β ; end return G At each step in the loop in Algorithm 1 we search forwhere to glimpse next so that the sampled glimpse maxi-mizes the total objectness contained within it. To speed upthis search, we ﬁrst compute an integral image of the currentmap π : S π ( i, j ) = (cid:88) i (cid:48) ≤ i,j (cid:48) ≤ j π ( i (cid:48) , j (cid:48) ) ∀ i, j (3)From this table, we can easily search for the glimpse withmaximum objectness by simply summing over shifted (ac-cording to d (cid:48) LR ) versions of S π and ﬁnding the location ofthe maximum. Using Python-style notation to illustrate in-dexing S π for the four shift operations, the glimpse withmaximum objectness over all search locations ( x, y ) can befound according to: p x,y = arg max x,y ( S π [: − d (cid:48) LR , : − d (cid:48) LR ] + S π [ d (cid:48) LR : , d (cid:48) LR :]) − ( S π [: − d (cid:48) LR , d (cid:48) LR :] + S π [ d (cid:48) LR : , : − d (cid:48) LR ]) (4) We consider now the task of target-based, open-setsearch. In this setting the goal is to search the image for3 igure 2. (left) Ground truth instance annotation for all DOTA classes. (middle) Low-resolution gist objectness map produced by the u-nettrained on DOTA6. (right) Original image with glimpse (cyan boxes) selections and detections (red dots). For a ﬁxed budget of four HRwindows, this example illustrates the potential for objectness-guided search to search the most likely occupied regions of interest ﬁrst. Notethat images are for visualization purposes only and are not shown to scale. tiles (sub-images) that support the presence of a desired tar-get object, while relying on only a single example image ofthat target (which we denote here as o ).Consider a set of candidate search locations obtained bytiling the original image into a set of sub-image tiles anddenote their coordinates as ( i, j ) . To achieve the goal ofoptimal open-set search, we use as criteria the likelihood l o ( i, j ) that the speciﬁc target object be found at that loca-tion. We estimate this likelihood by comparing the networkembedding representations f of the exemplar object image o and the tile T ( i, j ) : l o ( i, j ) = cos( f ( T ( i, j )) , f ( o )) . (5)We use the second-to-last layer of ResNet50 as the embed-ding function f ( . ) in computing this deep similarity l o ( i, j ) .We further consider a reﬁned approach that uses maxi-mum a posteriori (MAP) criteria, wherein we interpret theobjectness probability π i,j (in Eq. (2)), integrated over aspeciﬁc tile (at coordinates ( i, j ) ), as a prior that any objectof interest be present in that tile. The resulting a-posterioriprobability p o ( i, j ) that the target object is present in the tileis then: p o ( i, j ) = l o ( i, j ) π i,j . (6)We use these above metrics to obtain a rank ordering bydeep similarity of tiles to visit. Our search policy is chosensimply to visit tiles in decreasing order for either l o ( i, j ) or p o ( i, j ) . We call global searches ordered by likelihood l o ( i, j ) G-ML-MSTR (“global search / searching for mostsimilar to reference”) and global searchers ordered by a-posteriori probability p o ( i, j ) G-MAP-MSTR.

4. Experiments

To test our objectness -based approach, we run an evalu-ation on the DOTA/iSAID overhead dataset [33, 32] whichconsists of 2806 overhead images ranging in size from800x800 to 4000x4000. In order to test the ability of ourmethod to ﬁnd objects that are small relative to the size ofthe image, we narrow our evaluation to six object classes: small vehicle, large vehicle, plane, helicopter, ship, andstorage tank (henceforth referred to as DOTA6).We train all core models on the DOTA training split andevaluate on the images of the validation split. Of the 458images in the validation split, we test on only those contain-ing at least one instance of the objects in DOTA6.Figure 4 shows performance averaged across all DOTA6classes. Note that while the oracle detector represents bestcase performance, it is still unable to achieve perfect preci-sion since there are cases where sampled glimpses containpartial objects that fail to meet the IoU detection threshold.

We use a U-Net architecture based on four downsamplingand four upsampling blocks, where each block down-/up-samples the input feature map by a factor of two and thenpasses the result through two blocks of { convolution, batchnormalization, ReLU } . Inputs to the U-Net model also passthrough a similar block before undergoing downsampling.At each downsampling block, the number of channels isdoubled starting at 64 and ending at 512. Upsamplingblocks concatenate the upsampled feature maps with themaps from the skip connection (achieving twice the originalnumber of channels at the corresponding level in the down-sampling path) and produce feature maps with 512 chan-4 igure 3. Precision, recall, and F1 score by object class in DOTA6 vs. the number of HR glimpses evaluated. Results are separated bydetection method (YOLOv3 vs. best-case oracle). The best case is that each metric reaches 1.0 in the fewest number of glimpses (denotedas n hr ).Figure 4. Precision, recall, and F1 score averaged over object classes in DOTA6 vs. the number of HR glimpses evaluated. Results arepresented using the oracle detector.The best case is that each metric reaches 1.0 in the fewest number of glimpses (denoted as n hr ). nels at the lowest point of the U-Net down to 64 channels atthe highest point. Our U-Net accepts inputs of dimensions128x128x3 and returns a 1-channel objectness prediction ofdimension 128x128.Segmentation masks for the DOTA6 labels were con-verted to a binary map and our U-Net was trained via back-propgation using ADAM [13] optimization with a standardBinary Cross Entropy loss (Eq. (2)) between the objectnessprediction and ground-truth mask. In this work, we use a trained YOLOv3 detector operat-ing on 512x512 tiles. Because we do not make any al-gorithmic modiﬁcations to the HR-detector architecture it-self, the quality of the detection results hinges on our algo-rithm’s ability to select HR glimpses to evaluate. As such,our baselines were established to measure the quality of theglimpse selection. We implemented several policies to sam-ple glimpses either from a ﬁxed grid or ﬂexibly (allowingpossible overlap of glimpses). Baselines are described inTable 1.

We evaluate our method using two approaches: (1) atrained, state-of-the-art object detection model (see 4.2.2)and (2) an oracle detector that correctly detects anything

Policy Description grid Glimpses spaced uniformly with possible overlap for large n glimpse grid ﬁxed Non-overlapping tiles ordered randomlyrandom Overlapping glimpses whose coordinates are sampled uniformlyentropy Glimpses sampled according to entropy (highest to lowest) unet Objectness-guided search according to Algorithm 1 unet ﬁxed

Objectness-guided (highest to lowest) search of non-overlapping tiles

Table 1. Glimpse selection policies within a sampled high-resolution glimpse. In so doing wedecouple the effects of the glimpsing and object detectionstrategies.Additionally, in order to address computational efﬁ-ciency, we evaluate detection performance consideringcases with a ﬁxed HR glimpse budget. Results are shownin Figure 3. We determine true/false detections followingthe standard practice of using an Intersection-over-Union(IoU) threshold of 0.5 between predicted and true boundingboxes.

We again use DOTA6 to evaluate our open-set searchstrategies against the following baselines: a naive globalsliding window and local searches in predeﬁned regionsmaximizing similarity to the target tile, similarity to the ini-tial location, and similarity to the current tile. The localsearch is motivated by the observation that objects of inter-est tend to cluster in speciﬁc regions (e.g. cars clustered in5 parking lot).The sequences of search trajectories that these ap-proaches generate are compared – for illustrative purposes– in the lower panels of Figures 6, 8 and 7. In the top panesof these ﬁgures, we plot the recall value as a function of thenumber of search steps. Here we deﬁne recall in terms ofthe ground truth locations and assuming that the algorithmrecognizes an object when it lands upon it (i.e., oracle de-tection).An overall comparison of the different strategies overa set of N=11 images is provided in Figure 5. It can beseen that the best performance is achieved by the global MLand MAP-guided approaches, with a clear beneﬁt being ob-served when the prior π ( i, j ) in used in the selection of thenext tile. Agregated for all images: recall = f(normalized

G-ML-MSTR (most similar to reference)G-MAP-MSTR (prior-weighted most similar to reference)G-SW (sliding window)L-R (random in local neighborhood)L-ML-MSTR (most similar to reference - in neighborhood)L-MSTC (most similar to current - in neighborhood)

Figure 5. Performance (recall) comparison aggregated over allimages for all proposed methods against all baselines described.The y-axis is recall, the x-axis is the normalized number of looks(where 1 indicates that all tiles in the image have been searchedand explored.)

5. Discussion

Several observations can be made about our approachesand results.In the closed-set paradigm, we ﬁnd that conventional ob-ject detectors, such as YOLOv3, exhibit poor performancein the overhead domain. This is likely due to the differencesin object scale and density in overhead imagery compared tothe ground-level, natural images for which they were devel-oped. The situation is further complicated by the inherentvariability of the DOTA dataset, which was gathered froma variety of sensors and under a variety of imaging condi-tions. DOTA therefore contains signiﬁcant variation in bothimage resolution and Ground Sample Distance (GSD).In both the closed set object detection and open settarget-guided search tasks, our objectness maps provide astrong prior for improving the efﬁciency of the search. Theobjectness prior improves recall for closed-set detection,particularly in the ﬁrst few glimpses. This attribute is im-portant given typical computational budgets and because r e c a ll image: P1416 recall = f(normalized Figure 6. (top) performance comparison of MSTR-ML with otherbaseline approaches: recall vs normalized number of looks. (bot-tom) illustration of the search trajectory using MSTR-ML. of the absence of other detection methods for vHR over-head imagery. In open set target-guided search, we see thatglobal searches using similarity to the reference represen-tation solidly outperform the baselines. A noticeable gainis also seen when including the objectiveness prior in thatapproach.While we made considerable improvements over base-line methods, closed set object detection and open set searchin overhead imagery remain areas ripe for future research.Areas of future work may include extensions of the currentwork in both the closed and open set settings by using ob-jectness along with reinforcement learning [30] and, whencarrying out open set search in the context of applicationsin autonomy and robotics applications, the use of Bayesianﬁltering [1] to improve smoothness of search trajectories.

6. Conclusion

This study examines both closed-set object detection andopen-set target search. It proposes a method for predictingpixel-level objectness from low resolution gist images usedsubsequently for evaluation at high resolution and also ina Bayesian approach for open-set visual search that aims toﬁnd all objects in an image of the same type as a given targetreference. This work shows the beneﬁts of the objectness-guided approach for improving the efﬁciency of vHR image6rocessing. This is true both for selecting HR glimpses andfor incorporation as a prior in open set target search. Bothapproaches are shown to improve performance when com-pared to baseline methods. r e c a ll image: P1419 recall = f(normalized Figure 7. (top) performance comparison of MSTR-ML with otherbaseline approaches: recall vs normalized number of looks. (bot-tom) illustration of the search trajectory using MSTR-ML. r e c a ll image: P1418 recall = f(normalized Figure 8. (top) performance comparison of MSTR-ML with otherbaseline approaches: recall vs normalized number of looks. (bot-tom) illustration of the search trajectory using MSTR-ML. eferences [1] Amit Banerjee and Philippe Burlina. Efﬁcient particle ﬁlter-ing via sparse kernel density estimation. IEEE Transactionson Image Processing , 19(9):2480–2490, 2010.[2] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. YOLOv4: Optimal Speed and Accuracy ofObject Detection. arXiv:2004.10934 [cs, eess] , Apr. 2020.[3] Philippe Burlina, Neil Joshi, William Paul, Katia DPacheco, and Neil M Bressler. Addressing artiﬁcial intel-ligence bias in retinal disease diagnostics. arXiv preprintarXiv:2004.13515 , 2020.[4] Philippe Burlina, William Paul, Philip Mathew, Neil Joshi,Katia D Pacheco, and Neil M Bressler. Low-shot deep learn-ing of diabetic retinopathy with potential applications to ad-dress artiﬁcial intelligence bias in retinal diagnostics and rareophthalmic diseases.

JAMA ophthalmology , 138(10):1070–1077, 2020.[5] Philippe M Burlina, Neil Joshi, Katia D Pacheco, TY AlvinLiu, and Neil M Bressler. Assessment of deep generativemodels for high-resolution synthetic retinal image genera-tion of age-related macular degeneration.

JAMA ophthal-mology , 137(3):258–264, 2019.[6] Nicholas Carlini and David Wagner. Adversarial examplesare not easily detected: Bypassing ten detection methods. In

Proceedings of the 10th ACM Workshop on Artiﬁcial Intelli-gence and Security , pages 3–14, 2017.[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In

CVPR09 , 2009.[8] Adam Van Etten. You only look twice: Rapid multi-scaleobject detection in satellite imagery.

CoRR , abs/1805.09512,2018.[9] Chuanxing Geng, Sheng-jun Huang, and Songcan Chen. Re-cent advances in open set recognition: A survey.

IEEE Trans-actions on Pattern Analysis and Machine Intelligence , 2020.[10] R. Girshick. Fast r-cnn. In , pages 1440–1448, 2015.[11] Ross Girshick, Jeff Donahue, Trevor Darrell, and JitendraMalik. Rich feature hierarchies for accurate object detectionand semantic segmentation. arXiv:1311.2524 [cs] , 2014.arXiv: 1311.2524.[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In

Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016.[13] Diederik P Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980 ,2014.[14] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classiﬁcation with deep convolutional neural net-works.

Communications of the ACM , 60(6):84–90, 2017.[15] W. Liao, X. Chen, J. Yang, S. Roth, M. Goesele, M. Y. Yang,and B. Rosenhahn. Lr-cnn: Local-aware region cnn for vehi-cle detection in aerial imagery.

ISPRS Annals of Photogram-metry, Remote Sensing and Spatial Information Sciences , V-2-2020:381–388, Aug 2020. [16] Mike Pekala, Neil Joshi, TY Alvin Liu, Neil M Bressler,D Cabrera DeBuc, and Philippe Burlina. Deep learningbased retinal oct segmentation.

Computers in Biology andMedicine , 114:103445, 2019.[17] Katie Rainey, Shibin Parameswaran, Josh Harguess, andJohn Stastny. Vessel classiﬁcation in overhead satellite im-agery using learned dictionaries. In Andrew G. Tescher, edi-tor,

Applications of Digital Image Processing XXXV , volume8499, pages 741 – 752. International Society for Optics andPhotonics, SPIE, 2012.[18] Sachin Ravi and Hugo Larochelle. Optimization as a modelfor few-shot learning. 2016.[19] Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick,and Ali Farhadi. You only look once: Uniﬁed, real-time ob-ject detection. , pages 779–788, 2015.[20] Joseph Redmon and Ali Farhadi. Yolov3: An incrementalimprovement, 2018. cite arxiv:1804.02767Comment: TechReport.[21] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun.Faster r-cnn: Towards real-time object detection with regionproposal networks.

IEEE Transactions on Pattern Analysisand Machine Intelligence , 39:1137–1149, 2015.[22] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmen-tation. In

International Conference on Medical image com-puting and computer-assisted intervention , pages 234–241.Springer, 2015.[23] W. Sakla, G. Konjevod, and T. N. Mundhenk. Deep multi-modal vehicle detection in aerial isr imagery. In , pages 916–923, 2017.[24] Walter J Scheirer, Anderson de Rezende Rocha, ArchanaSapkota, and Terrance E Boult. Toward open set recogni-tion.

IEEE transactions on pattern analysis and machineintelligence , 35(7):1757–1772, 2012.[25] Jacob Shermeyer and Adam Van Etten. The effects of super-resolution on object detection performance in satellite im-agery. In

Proceedings of the IEEE/CVF Conference on Com-puter Vision and Pattern Recognition (CVPR) Workshops ,June 2019.[26] Reza Shokri, Marco Stronati, Congzheng Song, and VitalyShmatikov. Membership inference attacks against machinelearning models. In , pages 3–18. IEEE, 2017.[27] Shin-ya Takemura, C Shan Xu, Zhiyuan Lu, Patricia KRivlin, Touﬁq Parag, Donald J Olbris, Stephen Plaza, TingZhao, William T Katz, Lowell Umayam, et al. Synapticcircuits and their variations within different columns in thevisual system of drosophila.

Proceedings of the NationalAcademy of Sciences , 112(44):13711–13716, 2015.[28] Mingxing Tan, Ruoming Pang, and Quoc V. Le. Efﬁcientdet:Scalable and efﬁcient object detection. In

The IEEE/CVFConference on Computer Vision and Pattern Recognition(CVPR) , June 2020.[29] Tianyu Tang, Shilin Zhou, Zhipeng Deng, Huanxin Zou, andLin Lei. Vehicle detection in aerial images based on region onvolutional neural networks and hard negative examplemining. Sensors , 17(2):336, 2017.[30] Burak Uzkent, Christopher Yeh, and Stefano Ermon. Efﬁ-cient object detection in large images using deep reinforce-ment learning. , pages 1813–1822, 2020.[31] A. Van Etten. Satellite imagery multiscale rapid detectionwith windowed networks. In , pages 735–743, 2019.[32] Syed Waqas Zamir, Aditya Arora, Akshita Gupta, SalmanKhan, Guolei Sun, Fahad Shahbaz Khan, Fan Zhu, LingShao, Gui-Song Xia, and Xiang Bai. isaid: A large-scaledataset for instance segmentation in aerial images. In

Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops , pages 28–37, 2019.[33] Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Be-longie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liang-pei Zhang. Dota: A large-scale dataset for object detection inaerial images. In

The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , June 2018., June 2018.