Overhead MNIST: A Benchmark Satellite Dataset
OO VERHEAD
MNIST: A B ENCHMARK S ATELLITE D ATASET
David A. Noever and Samantha E. Miller Noever
PeopleTec, Inc., Huntsville, Alabama, USA [email protected] A BSTRACT
The research presents an overhead view of 10 important objects and follows the general formatting requirements of the most popular machine learning task: digit recognition with MNIST. This dataset offers a public benchmark extracted from over a million human-labelled and curated examples. The work outlines the key multi-class object identification task while matching with prior work in handwriting, cancer detection and retail datasets. A prototype deep learning approach with transfer learning and convolutional neural networks (MobileNetV2) correctly identifies the ten overhead classes with average accuracy of 96.7%. This model exceeds the peak human performance of 93.9%. For upgrading satellite imagery and object recognition, this new dataset benefits diverse endeavors such as disaster relief, land use management and other traditional remote sensing tasks. The work extends satellite benchmarks with new capabilities to identify efficient and compact algorithms that might work on-board small satellites, a practical task for future multi-sensor constellations. The dataset is available on Kaggle and Github. K EYWORDS
Neural Networks, Computer Vision, Image Classification, Satellite Imagery, MNIST Benchmark I NTRODUCTION
The most popular starting test for both new and established machine learning algorithms relies on handwritten digit [1] or letter [2] recognition. If a method does not work with the Modified National Institute of Standards and Technology dataset (MNIST), it most likely will not work on more challenging tasks. As illustrated in Figure 1, the core task corresponds to a multi-class image challenge, one which proves common and useful in other fields [3] outside of algorithms to interpret handwriting. Researchers over the last two decades [4] have spawned more than 48,000 MNIST-related publications, with a quarter of those appearing in 2020. The reverse of this universality stems from the relative ease which modern algorithms have solved the problem (>99% accuracy after a few iterations) [5-6]. Critics of the digit dataset {0-9} note that it contains near-duplicates and lacks diversity in example data such that modifying a single pixel (among 784 pixels in a 28x28 image) can flip some algorithms to misidentify the expected digit [7-9]. Alternative practical extensions of the digit recognition task now include alphabetic handwriting in multiple languages (e.g. English [2], Chinese [10], Russian [11], Kannada [12], American Sign Language [13]) and related everyday object recognition, the most popular of which includes 10 categories of skin cancer [14] and clothing [15] in thumbnail grayscale images (HAM1000 [14] and Fashion-MNIST [15]). The present work provides another challenging object recognition task: labelling objects from overhead satellite imagery (Figure 2). To take advantage of the vast machine learning literature on digit recognition, we mirror the format of the original MNIST closely [1] and thus, like Fashion-MNIST [15], we aim to provide the research community with another drop-in replacement for benchmarking [16]. The grayscale (28x28 pixel) imagery provides a challenging object recognition task [17]. As viewed from above, objects such as planes, ships, and stadiums offer no obvious preferred orientation, so rotating or image shearing may not augment dataset diversity [18]. Overhead-object classifiers also suffer from scale variations that
Figure 1. Examples of
MNIST families for digits, letters, signing, and objects an range more than two orders of magnitude between a small car to a stadium [19-21]. Compared to alternative terrestrial (color) datasets (like CIFAR-10 thumbnail [22]), a classifier of satellite imagery may span different camera resolutions, orientations, and day-night contrast levels in different seasons and shadows. The research offers a novel benchmark [23] for recognizing objects in thumbnail satellite images, a reformatting strategy to connect with the vast MNIST algorithm literature and finally, an initial solution to the classification problem using transfer learning and MobileNetV2 [24]. The original contributions of the present work include: 1) generalizing the standard MNIST format [1,16] and dataset design to handle challenging satellite object detection (called Overhead-MNIST); 2) exploring the unique aspects of overhead object recognition such as diverse object scale lengths and rotational-invariance [17,21]; 3) classifying 10 classes of objects with multiple algorithms, including some efficient enough to run on-board satellites for automated tasking and cueing. Absent the 70,000 overhead thumbnails of the original digit recognition [1], we examine the requirements for solving the problem as a function of sample size.
Figure 2. Ten Classes Labelled for Overhead MNIST.
The class selection includes dynamic (car, plane, helicopter, ship) and static (parking lot, runway mark, harbor) targets along with infrastructure-related objects (storage tanks, stadium, oil gas fields) M ETHODS
We curated the overhead imagery from multiple public sources, xView [17], UC Merced Land Use [25], DOTA [26], and SpaceNet [27]. The cojoined source dataset consists of 102 labelled classes after extraction from any bounding boxes. We resized all the objects to 28x28 pixels in grayscale using ImageMagick [28] command line tools. We approximately balanced (within 20%) the data between the 10 classes with 1000 examples per category. Helicopters are under-represented (800) and harbors are over-represented (1200). For these 10,000 training and testing images (90:10 ratio), the original data was pre-processed as labeled objects in 10 object classes (car, harbor, helicopter, oil gas field, parking lot, plane, runway mark, ship, stadium, and storage tank). While other common classes (like buildings) provide a quick test for urban landscapes, their diversity of scale and overlapping density discouraged their inclusion. Three different formats are provided for download, including comma-separated values files, JPEG images sorted by class (10 total), and the original MNIST binary format (idx-ubyte) [1]. These three formatting options should cover most all published MNIST solutions with only minor modifications [16]. To mimic the 10-digit recognizer classes. the CSV files for both testing and training have labels in alphabetic order (0-9, where 0=car,1=harbor, etc.). To make a drop-in replacement and establish contact with the existing MNIST datasets, we further process these satellite images into just 10 classes and generate CSV pixel values by converting the gray JPEG to text [28], then parsing out the pixel values in 784 columns for each image as a single labeled row. he selection of these classes is motivated by their frequent representation among the raw source imagery and diversity among object types of size, shape and background. To examine objects with different natural scale and distinct features, these criteria exclude some common object classes in xView [17], such as buildings, trucks, etc. The class selection includes a relative balance of dynamic (car, plane, helicopter, ship) and static (parking lot, runway mark, harbor) targets along with infrastructure-related objects (storage tanks, stadium, oil gas fields). This class ontology differs from a standard handwriting digit recognizer (0-9) in diversity [7] and resembles more closely the Fashion-MNIST ontology [15] with shirts, pants, etc. It is worth noting that for these small thumbnails in greyscale at 28x28 pixel, a human analyst might be challenged to distinguish the class differences particularly in their original formats that might include 16 million pixels in the full satellite scan. In the same way that CIFAR-10 dataset [22] provides ambiguous choices at the image size of a thumbnail, this scale also matches what one might encounter in recognizing space objects. For comparison, the 28x28 pixel image of a car in Figure 2 would offer the human analyst approximately 300 pixels to identify a 3x5 square meter object. The base ground sample distance for the original xView [20] and SpaceNet [27] imagery corresponds to WorldView-3 satellite images [21] with a resolution of approximately 0.3-0.7 m per pixel. R ESULTS
As shown in Table 1, a lightweight deep learning model (MobileNetv2) and the Overhead-MNIST dataset can reach 90-100% accuracies on the previously unseen test samples, with storage tanks (90%) the lowest score and planes (99%) the highest score. As shown in Figure 3, the confusion matrix shows correspondence between actual and predicted class assignments in the test set . Current state-of-the-art [6] is 99.84% accuracy for standard digit-recognition on MNIST (using capsule neuronal layer), compared to 96.4% accuracy found here for Overhead-MNIST (using standard convolutional neural networks). To compare these results across multiple algorithms and feature extraction techniques, we will examine additional benchmarking in a further companion publication. To understand the role of sample number and human comparison, we revisited Fashion-MNIST [15] with its 70,000 images and the same framework for transfer learning (MobileNetV2 [24]).
Since humans originally labelled the training and testing data, how could a machine learning algorithm ever exceed human performance? In practice, humans get tired. Not all humans have honed a skill for image analysis or the rapid visual identification of objects, particularly when scoring thousands of tiny thumbnails. Test
Table 1. Results for MobileNetV2 in transfer learning
Overhead-MNIST
Class Accuracy (Test
Oil gas field
Harbor
Parking lot
Ship
Car
Stadium
Helicopter
Runway mark
Storage tank
Figure 3. Confusion Matrix using MobileNetV2 and Overhead MNIST.
As a small and fast classifier, the overall average accuracy reaches 96.4% across all object types. ubjects disagree with the people who labelled the training set. For instance, when compared to a relentlessly attentive machine classifier, a human would have to be 100% right on digit recognition for a thin victory of a few-tenths of a percent. The more challenging Fashion-MNIST [15] offers a baseline 89.7% accuracy using a statistical technique (Support Vector Classifier, SVC) and greater than 97% using convolutional nets [29]. The comparable human benchmark for 1000 randomly sampled test images was 83.5% accurate among crowd-sourced classifiers (without any fashion expertise) [29]. One might argue the caveat for “no previous experience” limits the human-machine comparison. To supplement this prior work, we worked with one trained volunteer (graduate design degree and fashion experience) and scored 1000 images in a similar test to the published Fashion-MNIST benchmark for human performance. The labeller achieved nearly 100% accuracy in the less ambiguous categories (bag, sneaker) but underperformed the originally labelled data where overlap arises (shirt, coat, T-shirt). Two repetitions of the experimental findings are shown in Figure 4, including a new benchmark human accuracy of 85.12%, which compares will with the 89.7% accuracy of a statistical technique based on feature extraction and image pre-processing. To apply this human labeller as the second opinion on the Overhead-MNIST, Figure 5 summarizes the categorical correctness at 93.86%, with the greatest human ambiguity for parking lots and helicopters. It’s worth noting that helicopters and to a lesser extent ships naturally include distortion when transformed from their original rectangular bounding boxes into the square thumbnails for the MNIST format (Figure 2).
As most practitioners examine overall dataset robustness, accuracies that reach 100% point to over-training, lack of diversity, or insufficient size [7]. Overhead-MNIST offers 9584 labelled thumbnails, or approximately an order of magnitude fewer training instances compared to the original MNIST format and size. The non-digit related MNIST in Chinese [10] and Kannada [12] alternatively currently label 15,000 images total across more classes.
To investigate the differences between these other MNIST datasets (if greater than 70,000 images), we ran additional exploratory experiments on Fashion-MNIST. By shrinking the training data, one can see if the overall accuracy declines once the set reached the 10,000 examples offered by Overhead-
Figure 4. Human performance for Fashion-MNIST image classification . The results highlight variability between ambiguous class identifications that prove hard to decipher from thumbnails in grayscale.
Figure 5. Human performance for Overhead-MNIST image classification.
The results highlight variability between ambiguous class identifications that prove hard to decipher from thumbnails in grayscale.
Figure 6. Effects of sample size on Fashion-MNIST accuracy.
MobileNetv2 and transfer learning shows a drop in accuracy from 150 examples per class but losses plateau at 11 to 30 examples.
NIST. Most observers hypothesize that current image classifiers succeed in the range of 100-1000 examples per class, but fewer examples can still work for either transfer learning (10-100 examples) [30] or novel one-shot or few-shot architectures (<20 examples) [31]. Figure 6 summarizes the test results when the training data was varied between 10,000 and 1000 total examples, or approximately 1000-100 per class. The top result for the full 70,000 images (10,000 test) exceed 97%. For this style of diverse thumbnail dataset, one rough figure of merit therefore might consider each order of magnitude increase in training size increases the overall accuracy by around 10% between 1000-70,000 total cases until some dataset-dependent plateaus are reached. We created a larger 76,692 sample of augmented satellite imagery, after applying rotational and brightening transformations. One might hope that to reach superhuman performance [5], an optimal number of images corresponds to a realizable cross-over or goal, particularly given the shortage of more rare or uncollected overhead samples.
Satellite overhead imagery differs from many standard image classification challenges [18-19]. Overhead imagery offers no obvious up-and-down so augmentation [32] with rotation can offer diverse testing images that otherwise would not make sense for other datasets like fashion, digits, or alphabetic options. Similarly, the diversity of resolutions [21] offered from combined datasets and different satellites represent a noise and resolution variation that more standardized benchmarks can avoid in their domains. For leveraging previous MNIST work the major challenge in building a representative satellite dataset follows from a shortage of labelled data. The largest crowd-sourced implementation (xView) labelled a million objects with bounding boxes [20] but offered an unbalanced count that mirrors the expectations of urban landscapes: buildings outnumber vehicles but together those two classes dominate the remaining 60+ classes. Creating a balanced dataset across just 10 object classes here involved fusing multiple sources, rescaling, cropping, or chipping to the desired object centred in a grayscale thumbnail. Another design consideration stems from the lack of realism to growing Overhead-MNIST with more non-trivial examples. For instance, using quarter rotations of each image and then brightening them by 10% would generate the comparable 70,000+ examples in MNIST. However, the entire globe might never offer that many stadiums (4,827 currently).
For large, multi-class benchmarks, previous work [33-34] has isolated mis-labelled or very hard samples. Crowd-sourcing the labelling (via Amazon Mechanical Turk or other platforms) highlights the differences in human opinions or alternately the diversity of data collection needed to establish ground-truth. This challenge appears even in carefully curated datasets like Fashion-MNIST, where real physical poses and photographers define the object classes and ontology from the Zalando catalogue. But given the top 5 algorithms in a state-of-the-art (SOTA) list differ by a fraction of a percent [35], we explore the very hard samples shown in Figure 7. For 1000 random samples, the labelling by humans identified 7 examples of tough samples (0.7%), either because the images contained a human model, were not front facing, were outside the 10 categories or combined two categories (shirt and trousers). The point of this exercise is less to nit-pick a tough labelling challenge but to note that some of this ontology noise is expected but may bear
Figure 7. Very hard or mislabelled thumbnails in Fashion-MNIST. A.T-shirt with person. B. Coat with person C .Dress with person. D. Dress with two parts. E .Dress with bloomers. F. T-shirts. n the fine-tuning of certain algorithmic comparisons that follow, particularly for these highly popular ones [17] like MS-COCO, ImageNet, CIFAR-10 [22] and MNIST [1] families. D ISCUSSION AND C ONCLUSIONS
The Overhead-MNIST dataset covers satellite imagery of 10 frequently seen object classes. The format corresponds closely to traditional digit recognition benchmarks, including images (compressed JPEG), CSV and binary (idx-ubyte). With pre-trained convolutional neural networks, transfer learning achieves 96.4% accuracy for multi-class identification. A more complete algorithm survey will follow as a companion to the benchmark dataset and preparation described here. With a future goal for better on-board processing, companion algorithmic studies will leverage the vast MNIST literature to shrink classifier models so they run on newer and smaller hardware. A CKNOWLEDGEMENTS
The author would like to thank the PeopleTec Technical Fellows program for encouragement and project assistance. R EFERENCES [1] LeCun, Yann, Corinna Cortes, and C. J. Burges. "MNIST handwritten digit database." (2010): 18. http://yann.lecun.com/exdb/mnist/ and Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition."
Proceedings of the IEEE , 86(11):2278-2324, November 1998 [2] Cohen, Gregory, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. "EMNIST: Extending MNIST to handwritten letters." In , pp. 2921-2926. IEEE, 2017. [3] CV Online (accessed 01/2021) http://homepages.inf.ed.ac.uk/rbf/CVonline/Imagedbase.htm [4] Google Scholar search (accessed 01/2021), https://scholar.google.com/scholar?q=mnist and https://trends.google.com/trends/explore?date=all&q=mnist,ImageNet,%2Fg%2F11gfhw_78y [5] Chen, Li, Song Wang, Wei Fan, Jun Sun, and Satoshi Naoi. "Beyond human recognition: A CNN-based framework for handwritten character recognition." In , pp. 695-699. IEEE, 2015. [6] Image Classification on MNIST, (accessed 01/2021), https://paperswithcode.com/sota/image-classification-on-mnist [7] Grim, Jirı, and Petr Somol. "A Statistical Review of the MNIST Benchmark Data Problem." http://library.utia.cas.cz/separaty/2018/RO/grim-0497831.pdf [8] Schott, Lukas, Jonas Rauber, Matthias Bethge, and Wieland Brendel. "Towards the first adversarially robust neural network model on MNIST." arXiv preprint arXiv:1805.09190 (2018). [9] Cheng, Keyang, Rabia Tahir, Lubamba Kasangu Eric, and Maozhen Li. "An analysis of generative adversarial networks and variants for image synthesis on MNIST dataset."
Multimedia Tools and Applications arXiv preprint arXiv:1908.01242 arXiv preprint arXiv:1708.07747 “ Convert-own-data-to-MNIST-format” (accessed 01/2021) https://github.com/Arlen0615/Convert-own-data-to-MNIST-format [17] Rieke, Christoff, Awesome Satellite Imagery Datasets, (accessed 01/2021), see https://github.com/chrieke/awesome-satellite-imagery-datasets [18] Noever, David, Wes Regian, Matt Ciolino, Josh Kalin, Dom Hambrick, and Kaye Blankenship. "Discoverability in Satellite Imagery: A Good Sentence is Worth a Thousand Pictures." arXiv preprint arXiv:2001.05839 (2020). [19] Noever, David, Wes Regian, Matt Ciolino, Josh Kalin, Dom Hambrick. “Novel Scoring with Confusion Matrices for Satellite Image Captioning”, 2020 Southern Data Science Conference, August 12-14 2020, Atlanta, GA [20] Lam, Darius, Richard Kuzma, Kevin McGee, Samuel Dooley, Michael Laielli, Matthew Klaric, Yaroslav Bulatov, and Brendan McCord. "xView: Objects in context in overhead imagery." arXiv preprint arXiv:1802.07856 (2018). [21] WorldView-3 - Satellite Missions - eoPortal Directory – ESA, (accessed 01/2021), https://earth.esa.int/web/eoportal/satellite-missions/v-w-x-y-z/worldview-3
Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 4510-4520. 2018. [25] UC Merced Land Use Dataset, (accessed 01/2021), http://weegee.vision.ucmerced.edu/datasets/landuse.html [26] Xia, Gui-Song, Xiang Bai, Jian Ding, Zhen Zhu, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liangpei Zhang. "DOTA: A large-scale dataset for object detection in aerial images." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3974-3983. 2018. [27] Van Etten, Adam, Dave Lindenbaum, and Todd M. Bacastow. "SpaceNet: A remote sensing dataset and challenge series." arXiv preprint arXiv:1807.01232 (2018). [28] Salehi, Sohail.