[PDF] The Multi-Temporal Urban Development SpaceNet Dataset

Abstract

Satellite imagery analytics have numerous human development and disaster response applications, particularly when time series methods are involved. For example, quantifying population statistics is fundamental to 67 of the 231 United Nations Sustainable Development Goals Indicators, but the World Bank estimates that over 100 countries currently lack effective Civil Registration systems. To help address this deficit and develop novel computer vision methods for time series data, we present the Multi-Temporal Urban Development SpaceNet (MUDS, also known as SpaceNet 7) dataset. This open source dataset consists of medium resolution (4.0m) satellite imagery mosaics, which includes 24 images (one per month) covering >100 unique geographies, and comprises >40,000 km2 of imagery and exhaustive polygon labels of building footprints therein, totaling over 11M individual annotations. Each building is assigned a unique identifier (i.e. address), which permits tracking of individual objects over time. Label fidelity exceeds image resolution; this "omniscient labeling" is a unique feature of the dataset, and enables surprisingly precise algorithmic models to be crafted. We demonstrate methods to track building footprint construction (or demolition) over time, thereby directly assessing urbanization. Performance is measured with the newly developed SpaceNet Change and Object Tracking (SCOT) metric, which quantifies both object tracking as well as change detection. We demonstrate that despite the moderate resolution of the data, we are able to track individual building identifiers over time. This task has broad implications for disaster preparedness, the environment, infrastructure development, and epidemic prevention.

Full PDF

TThe Multi-Temporal Urban Development SpaceNet Dataset

Adam Van Etten , Daniel Hogan , Jesus Martinez-Manso , Jacob Shermeyer , Nicholas Weir , † , Ryan Lewis , † In-Q-Tel CosmiQ Works, [avanetten, dhogan]@iqt.org, Planet, [email protected], Capella Space, [email protected], Amazon, [weirnich, rstlewis]@amazon.com

Abstract

Satellite imagery analytics have numerous human de-velopment and disaster response applications, particularlywhen time series methods are involved. For example, quan-tifying population statistics is fundamental to 67 of the 231United Nations Sustainable Development Goals Indicators,but the World Bank estimates that over 100 countries cur-rently lack effective Civil Registration systems. To help ad-dress this deﬁcit and develop novel computer vision meth-ods for time series data, we present the Multi-TemporalUrban Development SpaceNet (MUDS, also known asSpaceNet 7) dataset. This open source dataset consists ofmedium resolution (4.0m) satellite imagery mosaics, whichincludes ≈ images (one per month) covering > unique geographies, and comprises > , km of im-agery and exhaustive polygon labels of building footprintstherein, totaling over 11M individual annotations. Eachbuilding is assigned a unique identiﬁer ( i.e . address), whichpermits tracking of individual objects over time. Label ﬁ-delity exceeds image resolution; this “omniscient labeling”is a unique feature of the dataset, and enables surprisinglyprecise algorithmic models to be crafted.We demonstrate methods to track building footprint con-struction (or demolition) over time, thereby directly assess-ing urbanization. Performance is measured with the newlydeveloped SpaceNet Change and Object Tracking (SCOT)metric, which quantiﬁes both object tracking as well aschange detection. We demonstrate that despite the moderateresolution of the data, we are able to track individual build-ing identiﬁers over time. This task has broad implicationsfor disaster preparedness, the environment, infrastructuredevelopment, and epidemic prevention.

1. Introduction

Time series analysis of satellite imagery poses an inter-esting computer vision challenge, with many human devel-opment applications. We aim to advance this ﬁeld throughthe release of a large dataset aimed at enabling new methods in this domain. Beyond its relevance for disaster response,disease preparedness, and environmental monitoring, timeseries analysis of satellite imagery poses unique technicalchallenges often unaddressed by existing methods.The MUDS dataset (also known as SpaceNet 7) consistsof imagery and precise building footprint labels over dy-namic areas for two dozen months, with each building as-signed a unique identiﬁer (see Section 3 for further details).In the algorithmic portion of this paper (Section 5), we fo-cus on tracking building footprints to monitor constructionand demolition in satellite imagery time series. We aim toidentify all of the buildings in each image of the time seriesand assign identiﬁers to track the buildings over time.Timely, high-ﬁdelity foundational maps are critical to agreat many domains. For example, high-resolution mapshelp identify communities at risk for natural and human-derived disasters. Furthermore, identifying new buildingconstruction in satellite imagery is an important factor inestablishing population estimates in many areas ( e.g . [7]).Population estimates are also essential for assessing burdenon infrastructure, from roads[4] to medical facilities [26].The inclusion of unique building identiﬁers in the MUDSdataset enable potential improvements upon existing coursepopulation estimates. Without unique identiﬁers buildingtracking is not possible; this means that over a given areaone can only determine how many new buildings exist.By tracking unique building identiﬁers one can determinewhich buildings changed (whose properties such as preciselocation, area, etc . can be correlated with features such asroad access, distance to hospitals, etc.), thus providing amuch more granular view into population growth.Several unusual features of satellite imagery ( e.g. smallobject size, high object density, dramatic image-to-imagedifference compared to frame-to-frame variation in videoobject tracking, different color band wavelengths andcounts, limited texture information, drastic changes in shad-ows, and repeating patterns) are relevant to other tasks anddata. For example, pathology slide images or other mi- † This work was completed prior to Nicholas Weir and Ryan Lewis join-ing Amazon a r X i v : . [ c s . C V ] F e b roscopy data present many of the same challenges [38].Lessons learned from this dataset may therefore have broad-reaching relevance to the computer vision community.

2. Related Work

Past time series computer vision datasets and algorith-mic advances have prepared the ﬁeld to address many of theproblems associated with satellite imagery analysis, allow-ing our dataset to explore additional computer vision prob-lems. The challenge built around the VOT dataset [15] sawimpressive results for video object tracking ( e.g. [36]), yetthis dataset differs greatly from satellite imagery, with highframe rates and a single object per frame. Other datasetssuch as MOT17 [17] or MOT20 [6] have multiple targetsof interest, but still have relatively few ( < ) objects perframe. The Stanford Drone Dataset [23] appears similarat ﬁrst glance, but has several fundamental differences thatresult in very different applications. That dataset containsoverhead videos taken at multiple hertz from a low eleva-tion, and typically have ≈ mobile objects (cars, people,buses, bicyclists, etc.) per frame. Because of the high framerate of these datasets, frame-to-frame variation is minimal(see the MOT17 example in Figure 1D). Furthermore, ob-jects are larger and less abundant in these datasets thanbuildings are in satellite imagery. As a result, video compe-titions and models derived therein provide limited insight inhow to manage imagery time series with substantial image-to-image variation and overly-dense instance annotations oftarget objects. Our data and research will address this gap.To our knowledge, no existing dataset has offered a deeptime series of satellite imagery. A number of previousworks have studied building extraction from satellite im-agery ([8], [5], [39], [27]), yet these datasets were static.The closest comparison is the xView2 challenge and dataset[10], which examined building damage in satellite imagepairs acquired before and after natural disasters ( i.e . onlytwo timestamps) in < locations; however, this task failsto address the complexities and opportunities posed by anal-ysis of deep time series data such as seasonal vegetation andlighting changes, or consistent object tracking on a globalscale. Other competitions have explored time series datain the form of natural scene video, e.g. object detection[6] and segmentation [2] tasks. There are several mean-ingful dissimilarities between these challenges and the taskdescribed here. Firstly, frame-to-frame variation is verysmall in video datasets (see Figure 1D). By contrast, the ap-pearance of satellite images can change dramatically frommonth to month due to differences in weather, illumination,and seasonal effects on the ground, as shown in Figure 1C.Other time series competitions have used non-imagery dataspaced regularly over longer time intervals [9], but none fo-cused on computer vision tasks.The size and density of target objects are very different in this dataset than past computer vision challenges. Whencomparing the size of annotated instances in the COCOdataset [18], there’s a clear difference in object size dis-tributions (see Figure 1A). These smaller objects intrinsi-cally provide less information as they comprise fewer pix-els, making their identiﬁcation a more difﬁcult task. Finally,the number of instances per image is markedly different insatellite imagery from the average natural scene dataset (seeSection 3 and Figure 1B). Other data science competitionshave explored datasets with similar object size and density,particularly in the microscopy domain [21, 11]; however,those competitions did not address time series applications.Taken together, these differences highlight substantial nov-elty for this dataset.

3. Data

The Multi-Temporal Urban Development SpaceNet(MUDS) dataset consists of 101 labelled sequences of satel-lite imagery collected by Planet Labs’ Dove constellationbetween 2017 and 2020, coupled with building footprint la-bels for every image. The image sequences are sampled atthe 101 distinct areas of interest (AOIs) across the globe,covering six continents (Figure 2). These locations wereselected to be both geographically diverse and display dra-matic changes in urbanization across a two-year timespan.The MUDS dataset is open sourced under a CC-BY-4.0 ShareAlike International license ‡ to encourage broaduse. This dataset can potentially be useful for many othergeospatial computer vision tasks: it can be easily fusedor augmented with any other data layers that are availablethrough web tile servers. The labels themselves can alsobe applied to any other remote sensing image tiles, such ushigh resolution optical or synthetic aperture radar. Images are sourced from Planet’s global monthlybasemaps, an archive of on-nadir imagery containing vi-sual RGB bands with a ground sample distance (GSD) ( i.e .pixel size) of ≈ meters. A basemap is a reduction ofall individual satellite captures (also called scenes) into aspatial grid. These basemaps are created by mosaicingthe best scenes over a calendar month, selected accordingto quality metrics like image sharpness and cloud cover-age. Scenes are stack-ranked with best on top, and spa-tially harmonized to smoothen scene boundary discontinu-ities. Monthly basemaps are particularly well suited for thecomputer vision analysis of urban growth, since they are rel-atively cloud-free, homogeneous, and represented in a con-sistent spatio-temporal grid. The monthly cadence is also agood match to the typical timescale of urban developments. ‡ https://registry.opendata.aws/spacenet/ igure 1: A comparison between our dataset and relateddatasets. A. Annotated objects are very small in this dataset.Plot represents normalized histograms of object size in pix-els. Blue is our dataset, red represents all annotations in theCOCO 2017 training dataset [18]. B. The density of annota-tions is very high in our dataset. In each × image,our dataset has between 10 and over 20,000 objects (mean:4,600). By contrast, the COCO 2017 training dataset has atmost 50 objects per image. C. Three sequential time pointsfrom one geography in our dataset, spanning 3 months ofdevelopment. Compare to D. , which displays three sequen-tial frames in the MOT17 video dataset [17].Figure 2: Location of MUDS data cubes.The size of each image is × pixels, corre-sponding to ≈

18 km , and the total area of the imagesin the dataset is ,

250 km . See Table 1 or spacenet.aifor additional statistics. The time series contain imagery of − months, depending on AOI (median of 24).This lengthy time span captures multiple seasons and at-mospheric conditions, as well as the commencement andcompletion of multiple construction projects. See Figure 3for examples. Images containing an excessive amount ofclouds or haze were fully excluded from the dataset, thuscausing minor temporal gaps in some of the time series. Each image in the dataset is accompanied by two setsof manually created annotations. The ﬁrst set of labelsare building footprint polygons deﬁning the outline of eachbuilding. Each building is assigned a unique identiﬁer ( i.e .address) that persists throughout the time series. The secondset of annotations are “unusable data masks” (UDMs) de-noting areas of images that are obscured by clouds (see Fig-ure 4) or that suffer from image geo-reference errors greaterthan 1 pixel. Geo-referencing is the process of mapping pix-els in sensor space to geographic coordinates, performedvia an empirical ﬁtting procedure that is never exact. Inrare cases, the scenes that compose the basemaps have spa-tial offsets of 5-10 meters. Accounting for such spatial dis-placements in the time series would make the modeling tasksigniﬁcantly harder. Therefore, we decided to eliminate thiscomplexity by including these regions in the UDM.Each image has between 10 and ≈ , building an-notations, with a mean of ≈ , (the earliest timepointsin some geographies have very few buildings completed).This represents much higher label density than natural scenedatasets like COCO [18] (Figure 1B), or even overheaddrone video datasets [34]. As the dataset comprises ≈ time points at 101 geographic areas, the ﬁnal dataset in-cludes > M annotations, representing > , uniquebuildings. (Compare the training data quantities shown forother datasets in Table 1.) The building areas vary betweenapproximately 0.25 and 13,000 pixels (median building areaof 193 m or 12.1 pix ), markedly smaller than most labelsin natural scene imagery datasets (Figure 1A).Seasonal effects and weather (i.e. background variation)pervade our dataset given the low frame rate of × − Hz(Figure 1C). This “background” change adds to the changedetection task’s difﬁculty. This frame-by-frame backgroundvariation is particularly unique and difﬁcult to recreate viasimulation or video re-sampling.

We deﬁne buildings as static man-made structures wherean individual could take shelter, with no minimum footprintsize. The uniqueness of the dataset presents distinct label-ing challenges. First, small buildings can be under-resolvedto the human eye in a given image, making it difﬁcult tolocate and discern from other non-building structures. Sec-ond, in locations undergoing building construction, it canigure 3: Time series of two data cubes. Left column ( e.g . 1a) denotes the start of the times series, the middle column ( e.g .1b) the approximate midpoint, and the right column ( e.g . 1c) shows the ﬁnal image. The top row displays imagery, while thebottom row illustrates the labeled building footprints.Table 1: Comparison of Selected Time Series Datasets

MUDS VOT-ST2020 MOT20 Stanford Drone DAVIS 2017 YouTube-VOS

Property [14] [6] [23] [3] [41]Scenes 101 60 4 60 90 4,453Total Frames 2,389 19,945 8,931 522,497 6,208 ∼ ∼

135 (mean)Ground Sample Dist. 4.0m n/a n/a ∼ be difﬁcult to determine what point in time the structure be-comes a building per our deﬁnition. Third, variability inimage quality, atmospheric conditions, shadows, and sea-sonal phenology can introduce additional confusion. Mit-igating these complexities and minimizing label noise wasof paramount importance, especially along the temporal di-mension. Even though the dataset AOIs were selected tocontain urban change, construction events are still highlyimbalanced compared to the full spatio-temporal volume.Thus, temporal consistency was a fundamental area of focusin the labeling strategy. In cases of high uncertainty with aparticular building candidate, annotators examined the fulltime series to gain temporal and contextual information ofthe precise location. For example, a shadow from a neigh-boring structure might be confused as a building, but thisbecomes evident when inspecting the full data cube. Tem-poral context can also help identify groups of objects. Someregions have structures that resemble buildings in a givenimage, but are highly variable in time. Objects that appearand disappear multiple times are unlikely to be buildings. Once one type of such ephemeral structures is identiﬁed asa confusion source, all other similar structures are also ex-cluded (Figure 5). Labeling took 7 months by a team of 5;each data cube was annotated by one person, reviewed andcorrected by another, with ﬁnal validation by the team lead.Annotators also used a privately-licensed high resolutionimagery map to help discriminate uncertain cases. This highresolution map is useful to gain contextual information ofthe region and to guide the precise building outlines thatare unclear from the dataset imagery alone. Once a build-ing candidate was identiﬁed in the MUDS imagery, the highresolution map was used to conﬁrm the building geometry.In other words, labels were not created on the high resolu-tion imagery ﬁrst. While the option of labeling on high res-olution might seem attractive, it poses labeling risks such ascapturing buildings that are not visible at all in the MUDSimagery. In addition, the high resolution map is static andcomposed of imagery acquired over a long range of dates,thus making it difﬁcult to perform temporal comparisonsbetween this map and the dataset imagery.a) Raw Image (b) UDM overlaid(c) Masked image + labels (d) Zoomed labelsFigure 4: Single image in a data cube. (a) Image with cloudcover. (b) Image with UDM overlaid. (c) Masked imagewith building labels overlaid. (d) Zoom showing the highﬁdelity of building labels.Figure 5: Example of how temporal context can help withobject identiﬁcation. If the middle image were to be la-beled in isolation, objects A and B could be annotated asbuildings. However, taking into account the adjacent im-ages, these objects exist only for one month and thereforeare unlikely to be buildings. Object C is also unlikely to bea building, just by group association.The procedure to annotate each time series can be sum-marized as follows:1. Start with the ﬁrst image in the series. Identify thelocation of all visible structures. If the building lo-cation and outline are clear, draw a polygon aroundit. Otherwise, overlay a high resolution optical mapto help conﬁrm the presence of the building and drawthe outline. Assign a unique integer identiﬁer to eachbuilding. In addition, identify any regions in the im-age with impaired ground visibility or defects and addtheir polygons to the UDM layer of this image. Figure 6: Zoom in of one particularly dense region illustrat-ing the very high ﬁdelity of labels. (a) Raw image. (b) Foot-print polygon labels. (c) Footprints overlaid on imagery.2. Copy all the building labels onto the next image (notthe UDM). Examine carefully all buildings in the newimage, and edit the labels with any changes. Edits areonly be made when there is signiﬁcant conﬁdence thata building appeared or disappeared. If a new build-ing appeared, assign a new unique identiﬁer. Togglethrough multiple images in the time series to ensure:(a) there is a true building change and (b) that it is ap-plied to the correct time point. Also, create a UDM.3. Repeat step 2 for the remaining time points.This process attempts to enforce temporal consistencyand reduce object confusion. While label noise is appre-ciable in small objects, the use of high resolution imageryto label results in labels of signiﬁcantly higher ﬁdelity thatwould be achievable from the Planet data alone, as illus-trated in Figure 6. This “omniscient labeling” is one of thekey features of the MUDS dataset. We will show in Sec-tion 5 that the baseline algorithm does a surprisingly goodjob of extracting high-resolution features from the medium-resolution imagery. In effect, the labels are encoding infor-mation that is not visible to humans in the imagery, whichthe baseline algorithm is able to capitalize upon.

4. Evaluation Metrics

To evaluate model performance on a time series ofidentiﬁer-tagged footprints such as MUDS, we introduce anew evaluation metric: the SpaceNet Change and ObjectTracking (SCOT) metric [13]. As discussed later, existingmetrics have a number of shortcomings that are addressedby SCOT. The SCOT metric combines two terms: a track-ing term and a change detection term. The tracking termevaluates how often a proposal correctly tracks the samebuildings from month to month with consistent identiﬁernumbers. In other words, it measures the model’s abilityto characterize what stays the same as time goes by. Thechange detection term evaluates how often a proposal cor-rectly picks up on the construction of new buildings. Inother words, it measures the model’s ability to characterizewhat changes as time goes by.For both terms, the calculation starts the same way: ﬁnd-ng “matches” between ground truth building footprints andproposal building footprints for each month. A pair of foot-prints (one ground truth and one proposal) are eligible to bematched if their intersection over union (IOU) exceeds 0.25,and no footprint may be matched more than once. We se-lect an IOU of 0.25 to mimic Equation 5 of ImageNet [25]),which sets IOU < . for small objects. A set of matchesis chosen that maximizes the number of matches. If there ismore than one way to achieve that maximum, then as a tie-breaker the set with the largest sum of IOUs is used. This isan example of the unbalanced linear assignment problem incombinatorics.If model performance were being evaluated for a sin-gle image (instead of a time series), a customary next stepmight be calculating an F1 score, where matches are con-sidered true positives ( tp ) and unmatched ground truth andproposal footprints are considered false negatives ( f n ) andfalse positives ( f p ) respectively. F = tptp + ( f p + f n ) (1)The tracking term and change detection term both gen-eralize this to a time series, each in a different way.The tracking term penalizes inconsistent identiﬁersacross time steps. A match is considered a “mismatch”if the ground truth footprint’s identiﬁer was most recentlymatched to a different proposal ID, or vice versa. For thepurpose of the tracking term, mismatches ( mm ) are notcounted as true positives. So each mismatch decreases thenumber of true positives by one. This effectively divorcesthe ground truth footprint from its mismatched proposalfootprint, creating an additional false negative and an ad-ditional false positive. That amounts to the following trans-formations: tp → tp − mmf p → f p + mmf n → f n + mm (2)Applying these to the F1 expression above gives the formulafor the tracking term: F track = tp − mmtp + ( f p + f n ) (3)The second term in the SCOT metric, the change detec-tion term, incorporates only new footprints. That is, groundtruth or proposal footprints with identiﬁer numbers makingtheir ﬁrst chronological appearance. Letting the subscript new indicate the count of tp ’s, f p ’s, and f n ’s that persistafter dropping non-new footprints: F change = tp new tp new + ( f p new + f n new ) (4) JanuaryFebruaryMarchAprilMay

Location T i m e Match that is not a mismatchMismatch Unmatchedproposal 𝐹 !" = &&’ !" (’) = 0.593 JanuaryFebruaryMarchAprilMay

Location T i m e New buildings onlyNew proposal IDs only 𝐹 !" = ’’( !" )(’ = 0.333 (a) Tracking Term (b) Change TermFigure 7: (a) Example of SCOT metric tracking term. Solidbrown polygons are ground truth building footprints, andoutlines are proposal footprints. Each footprint’s corre-sponding identiﬁer number is shown. (b) Example of SCOTmetric change detection term, using the same set of groundtruth and proposal footprints. This term ignores all groundtruth and proposal footprints with previously-seen identi-ﬁers, which are indicated in a faded-out gray color.One important property of this term is that a set of staticproposals that do not vary from one month to another willreceive a change detection term of 0, even for a time serieswith very little new construction. (In the MUDS dataset,the construction of new buildings is by far the most com-mon change; the metric could be generalized to accommo-date building demolition or other changes by any of severalstraightforward generalizations.)To compute the ﬁnal score, the two terms are combinedwith a weighted harmonic mean: F scot = (1 + β ) F change · F track β F change + F track (5)We use a value of β = 2 to emphasize the part of the task(tracking) that has been less commonly explored in an over-head imagery context. For a dataset like MUDS with multi-ple AOIs, the overall SCOT score is the arithmetic mean ofthe scores of the individual AOIs.Figure 7a is a cartoon example of calculating the trackingterm on a row of four buildings imaged over ﬁve months(during which time two of the four are newly-constructed,and two are temporarily occluded by clouds). Figure 7billustrates the change detection term for the same case.For geospatial work, the SCOT metric has a numberof advantages over evaluation metrics developed for objecttracking in video, such as the Multiple Object Tracking Ac-curacy (MOTA) metric [1]. MOTA scores are mathemati-cally unbounded, making them less intuitively interpretablefor challenging low-score scenarios, and sometimes evenyielding negative scores. More critically, for scenes withonly a small amount of new construction, it’s possible toachieve a high MOTA score with a set of proposal footprintshat shows no time-dependence whatsoever. Since under-standing time-dependence is usually a primary purpose oftime series data, this is a serious drawback. SCOT’s changedetection term prevents this. In fact, many such approachesto “gaming” the SCOT metric by artiﬁcially increasing oneterm will decrease the other term, leaving no obvious alter-native to intuitively-better model performance as a way toraise scores.

5. Experiments

For object tracking, one could in theory leverage the re-sults of previous challenges ( e.g . MOT20 [6]), yet the sig-niﬁcant differences between MUDS and previous datasetssuch as high density and small object size (see Figure 1)render previous approaches unsuitable. For example, ap-proaches such as TrackR-CNN [35] are untrainable as eachinstance requires a separate channel resulting in a mem-ory explosion for images with many thousands of objects.Other approaches such as Joint Detection and Embedding(JDE) [37] are trainable; however inference results are ul-timately incoherent due to the tiny object size and densityoverwhelming the YOLOv3 [22] detection grid. Despitethese challenges, the spatially static nature of our objects ofinterest somewhat simpliﬁes tracking objects between eachobservation. Consequently, this dataset should incentivizethe development of new object tracking algorithms that cancope with a lack of resolution, spatial stasis, minimal size,and dense clustering of objects.As a result of the challenges listed above, we choose toexperiment with semantic segmentation based approachesto detect and track buildings over time. These methods areadapted from prize winning approaches for the SpaceNet4 and 6 Building Footprint Extraction Challenges [40, 28].Our architecture comprises a U-Net [24] with different en-coders. The ﬁrst “baseline” approach uses a VGG16 [30]encoder and a custom loss function of L = J + 4 · BCE ,where J is Jaccard distance and BCE denotes binarycross entropy. The second approach uses a more advancedEfﬁcientNet-B5 [32] encoder with a loss of L = F + D where F is Focal loss [19] and D is Dice loss.To ensure robust testing statistics, we train the model on60 data cubes, testing on the remaining 41 data cubes. Wetrain the segmentation models with an Adam optimizer onthe 1424 images of the training set for 300 epochs and alearning rate of − (baseline) or 100 epochs and a learn-ing rate of × − (EfﬁcentNet).At inference time binary building prediction masks areconverted to instance segmentations of building footprints.Each footprint at t = 0 is assigned a unique identiﬁer.For each subsequent time step building footprints polygonsare compared to the positions of the previous time step.Building identiﬁer matching is achieved by an optimizedmatching of polygons with a minimum IOU overlap of 0.25. Figure 8: Baseline algorithm for building footprint extrac-tion and identiﬁer tracking showing evolution from T = 0(top row) to T = 5 (bottom row). The input image is fedinto our segmentation model, yielding a building mask (sec-ond column). This mask is reﬁned into building footprints(third column), and unique identiﬁers are allocated (rightcolumn).Figure 9: Example tracking performance of the baselinealgorithm. Note that larger, well-separated buildings aretracked well between epochs, while denser regions are morechallenging for tracking.Table 2: Building Tracking Performance Metric ApproachVGG-16 EfﬁcentNet

F1 (IOU ≥ . ) . ± .

13 0 . ± . Tracking Score . ± .

10 0 . ± . Change Score . ± .

05 0 . ± . SCOT . ± .

10 0 . ± . Matched footprintes are assigned the same identiﬁer as theprevious timestep, while footprints without signiﬁcant over-lap with preceding geometries are assigned a new uniqueidentiﬁer. The baseline algorithm is illustrated in Figure8; note that building identiﬁers are well matched betweenepochs. Performance is summarized in Table 2. For scoringwe assess only buildings with area ≥ .Localizing and tracking buildings in medium resolu-tion ( ≈ m) imagery is quite challenging, but surpris-ingly achievable in our experiments. For well separatedbuildings, building localization and tracking performs fairlywell; for example in Figure 9) we ﬁnd a localization F1igure 10: Prediction in a difﬁcult, crowded region. De-spite the inherent difﬁculties in separating nearby buildingsat medium resolution, for this image F1 = 0.40.score of 0.55, and a SCOT score of 0.31. For dense regions,building tracking is far more difﬁcult; in Figure 10 we stillsee decent performance in building localization (F1 = 0.40),yet building tracking and change detection is very chal-lenging (SCOT = 0.07) since inter-epoch footprints overlappoorly. The change term of SCOT is particularly challeng-ing, as correctly identifying the origin epoch of each build-ing is non-trivial, and spurious proposals are also penalized.In an attempt to raise the scores of Table 2, we also en-deavor to incorporate the time dimension into training. Aspreviously mentioned, existing approaches transfer poorlyto this dataset, so we attempt a simple approach of stackingmultiple images at training time. For each date we train onthe imagery for that date plus the four chronologically ad-jacent future observations [ t = 0 , t + 1 , t + 2 , t + 3 , t + 4 ]for ﬁve total dates of imagery. When the number of remain-ing observations in the time series becomes less than ﬁve,we repeatedly append the ﬁnal image for each area of in-terest. We ﬁnd no improvement with this approach (SCOT = 0 . ± . ).We also note no signiﬁcant difference in scores betweenthe VGG-16 and EfﬁcentNet architectures (Table 2), imply-ing that older architectures are essentially as adept as state-of-the-art architectures when it comes to extracting infor-mation from the small objects in this dataset.While not fully explored here, we also anticipate thatresearchers may improve upon the baseline using modelsspeciﬁcally intended for time series analysis ( e.g. RecurrentNeural Networks (RNNs) [20] and Long-Short Term Mem-ory networks (LSTMs) [12]. In addition, numerous “classi-cal” geospatial time series methods exist ( e.g. [42]) whichresearchers may ﬁnd valuable to incorporate into their anal-ysis pipelines as well.

6. Discussion

Intriguingly, the score of F . for our baselinemode parallels previous results observed in overhead im-agery. [29] studied object detection performance in xView[16] satellite imagery for various resolutions and ﬁve dif-ferent object classes. These authors used the YOLT [33]object detection framework, which uses a custom networkbased on the Googlenet [31] architecture. The mean extentof the objects in this paper was 5.3 meters; at a resolution of 1.2 meters objects have an average extent of 4.4 pixels.The average building area for the MUDS dataset is 332m , implying an extent of 18.2 m for a square object. Fora 4 meter resolution, this gives an average extent of 4.5pixels, comparable to the 4.4 pixel extent of xView. Theobserved MUDS F1 score of 0.45 is within error bars ofthe results of the xView results, see Table 3. Of particu-lar note is that while the F1 scores and object pixel sizes ofTable 3 are comparable, the datasets stem from vastly differ-ent sensors, and the techniques are wildly different as well(a Googlenet-based object detection architecture versus aVGG16-based segmentation architecture). Apparently, ob-ject detection performance holds across sensors and algo-rithms as long as object pixel sizes are comparable.Table 3: F1 Performance Across Datasets Dataset GSD (m) Object Size (pix) F1 xView 1.2 4.4 . ± . MUDS 4.0 4.5 . ± .

7. Conclusions

The Multi-temporal Urban Development SpaceNet(MUDS, also known as SpaceNet 7) dataset is a newlydeveloped corpus of imagery and precise labels designedfor tracking building footprints and unique identiﬁers. Thedataset covers over 100 locations across 6 continents, witha deep temporal stack of 24 monthly images and over11,000,000 labeled objects. The signiﬁcant scene-to-scenevariation of the monthly images poses a challenge for com-puter vision algorithms, but also raises the prospect ofdeveloping algorithms that are robust to seasonal changeand atmospheric conditions. One of the key characteris-tics of the MUDS dataset is exhaustive “omniscient label-ing” with labels precision far exceeding the base imageryresolution of 4 meters. Such dense labels present signiﬁ-cant challenges in crowded urban environments, though wedemonstrate surprisingly good building extraction, tracking,and change detection performance with our baseline algo-rithm. Intriguingly, our object detection performance of F . for objects averaging 4-5 pixels in extent is con-sistent with previous object detection studies, even thoughthese studies used far different algorithmic techniques anddatasets. There are numerous avenues of research beyondthe scope of this paper that we hope the community willtackle with this dataset: the efﬁcacy of super-resolution,adapting video time-series techniques to the unique featuresof MUDS, experimenting with RNNs, Siamese networks,LSTMs, etc. Furthermore, the dataset has the potential toaid a number of humanitarian efforts connected with popu-lation dynamics and UN sustainable development goals. eferences [1] Keni Bernardin, Alexander Elbs, and Rainer Stiefelhagen.Multiple object tracking performance metrics and evalua-tion in a smart room environment. Sixth IEEE InternationalWorkshop on Visual Surveillance, in conjunction with ECCV ,2006. 6[2] Sergi Caelles, Jordi Pont-Tuset, Federico Perazzi, AlbertoMontes, Kevis-Kokitsi Maninis, and Luc Van Gool. The2019 davis challenge on vos: Unsupervised multi-object seg-mentation. arXiv:1905.00737 , 2019. 2[3] Sergi Caelles, Jordi Pont-Tuset, Federico Perazzi, AlbertoMontes, Kevis-Kokitsi Maninis, and Luc Van Gool. The2019 DAVIS challenge on VOS: Unsupervised multi-objectsegmentation. arXiv:1905.00737 , 2019. 4[4] Simiao Chen, Michael Kuhn, Klaus Prettner, and David E.Bloom. The global macroeconomic burden of road injuries:estimates and projections for 166 countries. 2019. 1[5] Ilke Demir, Krzysztof Koperski, David Lindenbaum, GuanPang, Jing Huang, Saikat Basu, Forest Hughes, Devis Tuia,and Ramesh Raskar. Deepglobe 2018: A challenge to parsethe earth through satellite images. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recog-nition (CVPR) Workshops , June 2018. 2[6] Patrick Dendorfer, Hamid Rezatoﬁghi, Anton Milan, JavenShi, Daniel Cremers, Ian Reid, Stefan Roth, KonradSchindler, and Laura Leal-Taix´e. MOT20: A benchmark formulti object tracking in crowded scenes, 2020. 2, 4, 7[7] R. Engstrom, D. Newhouse, and V. Soundararajan. Estimat-ing small area population density using survey data and satel-lite imagery: An application to sri lanka.

Urban Economics& Regional Studies eJournal , 2019. 1[8] Adam Van Etten, Dave Lindenbaum, and Todd M. Bacas-tow. Spacenet: A remote sensing dataset and challenge se-ries.

CoRR , abs/1807.01232, 2018. 2[9] Google. Web trafﬁc time series forecasting: Forecast futuretrafﬁc to wikipedia pages. 2[10] Ritwik Gupta, Richard Hosfelt, Sandra Sajeev, Nirav Patel,Bryce Goodman, Jigar Doshi, Eric Heim, Howie Choset,and Matthew Gaston. Creating xbd: A dataset for assessingbuilding damage from satellite imagery. In

Proceedings ofthe 2019 CVF Conference on Computer Vision and PatternRecognition Workshops , 2019. 2[11] Booz Allen Hamilton and Kaggle. Data science bowl 2018:Spot nuclei. speed cures. 2[12] Sepp Hochreiter and Jurgen Schmidhuber. Long short-termmemory.

Neural Computation , 9, 1997. 8[13] Daniel Hogan and Adam Van Etten. The SpaceNet changeand object tracking (SCOT) metric, August 2020. 5[14] Matej Kristan, Ales Leonardis, Jiri Matas, Michael Fels-berg, Roman Pﬂugfelder, Joni-Kristian Kamarainen, LukaˇCehovin Zajc, Martin Danelljan, Alan Lukezic, Ondrej Dr-bohlav, Linbo He, Yushan Zhang, Song Yan, Jinyu Yang,Gustavo Fernandez, et al. The eighth visual object trackingVOT2020 challenge results, 2020. 4[15] Matej Kristan, Jiri Matas, Aleˇs Leonardis, Tomas Vojir, Ro-man Pﬂugfelder, Gustavo Fernandez, Georg Nebehay, Fatih Porikli, and Luka ˇCehovin. A novel performance evaluationmethodology for single-target trackers.

IEEE Transactionson Pattern Analysis and Machine Intelligence , 38(11):2137–2155, Nov 2016. 2[16] Darius Lam, Richard Kuzma, Kevin McGee, Samuel Doo-ley, Michael Laielli, Matthew Klaric, Yaroslav Bulatov, andBrendan McCord. xview: Objects in context in overheadimagery.

CoRR , abs/1802.07856, 2018. 8[17] Laura Leal-Taix´e, Anton Milan, Konrad Schindler, DanielCremers, Ian D. Reid, and Stefan Roth. Tracking the track-ers: An analysis of the state of the art in multiple objecttracking.

CoRR , abs/1704.02781, 2017. 2, 3[18] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D.Bourdev, Ross B. Girshick, James Hays, Pietro Perona, DevaRamanan, Piotr Doll´ar, and C. Lawrence Zitnick. MicrosoftCOCO: common objects in context.

CoRR , abs/1405.0312,2014. 2, 3[19] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, andPiotr Doll´ar. Focal loss for dense object detection. In

Pro-ceedings of the IEEE international conference on computervision , pages 2980–2988, 2017. 7[20] Tom´aˇs Mikolov, Martin Karaﬁ´at, Luk´aˇs Burget, JanˇCernock´y, and Sanjeev Khudanpur. Recurrent neural net-work based language model. In

Proceedings of the 11th An-nual Conference of the International Speech CommunicationAssociation (INTERSPEECH 2010) , 2010. 8[21] Recursion Pharmaceuticals. Cellsignal: Disentangling bio-logical signal from experimental noise in cellular images. 2[22] Joseph Redmon and Ali Farhadi. Yolov3: An incrementalimprovement. arXiv preprint arXiv:1804.02767 , 2018. 7[23] Alexandre Robicquet, Amir Sadeghian, Alexandre Alahi,and Silvio Savarese. Learning social etiquette: Human tra-jectory understanding in crowded scenes. In

ECCV , 2016. 2,4[24] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmen-tation. In

Proceedings of the 2015 International Conferenceon Medical Image Computing and Computer-Assisted Inter-vention , 2015. 7[25] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,Aditya Khosla, Michael S. Bernstein, Alexander C. Berg,and Fei-Fei Li. Imagenet large scale visual recognition chal-lenge.

CoRR , abs/1409.0575, 2014. 6[26] Nadine Schuurman, Robert S. Fiedler, Stefan C.W. Grzy-bowski, and Darrin Grund. Deﬁning rational hospital catch-ments for non-urban areas based on travel time. 5, 2006. 1[27] Jacob Shermeyer, Daniel Hogan, Jason Brown, AdamVan Etten, Nicholas Weir, Fabio Paciﬁci, Ronny Hansch,Alexei Bastidas, Scott Soenen, Todd Bacastow, and RyanLewis. Spacenet 6: Multi-sensor all weather mappingdataset. In

Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition (CVPR) Work-shops , June 2020. 2[28] Jacob Shermeyer, Daniel Hogan, Jason Brown, AdamVan Etten, Nicholas Weir, Fabio Paciﬁci, Ronny Hansch,Alexei Bastidas, Scott Soenen, Todd Bacastow, and Ryanewis. Spacenet 6: Multi-sensor all weather mappingdataset. In

Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition (CVPR) Work-shops , June 2020. 7[29] Jacob Shermeyer and Adam Van Etten. The effects of super-resolution on object detection performance in satellite im-agery. In

Proceedings of the IEEE/CVF Conference on Com-puter Vision and Pattern Recognition (CVPR) Workshops ,June 2019. 8[30] Karen Simonyan and Andrew Zisserman. Very deep convo-lutional networks for large-scale image recognition. In

Pro-ceeings of the 2015 International Conference on LearningRepresentations , 2015. 7[31] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In , pages1–9, 2015. 8[32] Mingxing Tan and Quoc V. Le. Efﬁcientnet: Rethinkingmodel scaling for convolutional neural networks, 2020. 7[33] A. Van Etten. Satellite imagery multiscale rapid detectionwith windowed networks. In , pages 735–743, 2019. 8[34] Stanford Computational Vision and Geometry Lab. Stanforddrone dataset. 3[35] Paul Voigtlaender, Michael Krause, Aljosa Osep, JonathonLuiten, Berin Balachandar Gnana Sekar, Andreas Geiger,and Bastian Leibe. MOTS: Multi-object tracking and seg-mentation, 2019. 7[36] Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, andPhilip H.S. Torr. Fast online object tracking and segmenta-tion: A unifying approach. In

The IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) , June 2019. 2[37] Zhongdao Wang, Liang Zheng, Yixuan Liu, Yali Li, andShengjin Wang. Towards real-time multi-object tracking,2020. 7[38] Nicholas Weir, JJ Ben-Joseph, and Dylan George. Viewingthe world through a straw: How lessons from computer vi-sion applications in geo will impact bio image analysis, Jan2020. 2[39] Nicholas Weir, David Lindenbaum, Alexei Bastidas,Adam Van Etten, Sean McPherson, Jacob Shermeyer, VarunKumar, and Hanlin Tang. Spacenet mvoi: A multi-view over-head imagery dataset. In

Proceedings of the IEEE/CVF In-ternational Conference on Computer Vision (ICCV) , October2019. 2[40] Nicholas Weir, David Lindenbaum, Alexei Bastidas,Adam Van Etten, Sean McPherson, Jacob Shermeyer,Varun Kumar Vijay, and Hanlin Tang. Spacenet MVOI: amulti-view overhead imagery dataset. In

Proceedings of the2019 International Conference on Computer Vision , volumeabs/1903.12239, 2019. 7[41] Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, YuchenLiang, Jianchao Yang, and Thomas Huang. YouTube-VOS:A large-scale video object segmentation benchmark, 2018. 4 [42] Zhe Zhu. Change detection using landsat time series: A re-view of frequencies, preprocessing, algorithms, and applica-tions.