Mars Image Content Classification: Three Years of NASA Deployment and Recent Advances
Kiri Wagstaff, Steven Lu, Emily Dunkel, Kevin Grimes, Brandon Zhao, Jesse Cai, Shoshanna B. Cole, Gary Doran, Raymond Francis, Jake Lee, Lukas Mandrake
MMars Image Content Classification:Three Years of NASA Deployment and Recent Advances
Kiri Wagstaff , Steven Lu , Emily Dunkel , Kevin Grimes , Brandon Zhao , Jesse Cai ,Shoshanna B. Cole , Gary Doran , Raymond Francis , Jake Lee , and Lukas Mandrake Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA, 91109-8099, USA Duke University, 2138 Campus Drive, Durham, NC 27708, USA California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA Space Science Institute, 4765 Walnut St, Suite B, Boulder, CO 80301, USA { kiri.l.wagstaff,you.lu,emily.dunkel,kevin.m.grimes } @jpl.nasa.gov, [email protected], [email protected],[email protected], { gary.b.doran.jr,raymond.francis,jake.h.lee,lukas.mandrake } @jpl.nasa.gov Abstract
The NASA Planetary Data System hosts millions of imagesacquired from the planet Mars. To help users quickly find im-ages of interest, we have developed and deployed content-based classification and search capabilities for Mars orbitaland surface images. The deployed systems are publicly ac-cessible using the PDS Image Atlas. We describe the processof training, evaluating, calibrating, and deploying updates totwo CNN classifiers for images collected by Mars missions.We also report on three years of deployment including usagestatistics, lessons learned, and plans for the future.
Introduction
The NASA Planetary Data System (PDS) maintains archivesof data collected by NASA missions that explore our so-lar system. The PDS Cartography and Imaging SciencesNode (Imaging Node) provides access to millions of imagesof planets, moons, comets, and other bodies. Given the largeand continually growing volume of data, there is a need fortools that enable users to quickly search for images of inter-est. Each image product is described by a rich set of search-able metadata properties such as the time it was collected,the instrument used, the image target, local season, etc.However, users often wish to search on the content of theimage to zero in on those images most relevant to a scien-tific investigation or individual curiosity. Manually search-ing through millions of images is infeasible. In previouswork, we trained image classifiers to detect classes of in-terest in Mars orbital and surface images (Wagstaff et al.2018). Using the predictions made by these classifiers, userscan interactively search for classes of interest using the PDSImage Atlas . Since the deployment of these classifiers inlate 2016 and through August 2020, their predictions havebeen used to satisfy , searches on the Atlas website.In this paper, we report on several new advances withinthis domain. First, we expanded the set of classes known toeach classifier to broaden their coverage of different con-tent types. Second, we employed classifier calibration to https://pds-imaging.jpl.nasa.gov/search/ produce more reliable posterior probabilities, which is vi-tal since only classifications with a posterior probability ofat least 0.9 are displayed to users. Finally, we now report onthree years of deployment including usage statistics, lessonslearned, and plans for the future. Related Work
Machine learning image classification has achieved highlevels of performance since the adoption of convolu-tional neural networks (CNNs) trained on millions of im-ages (Krizhevsky, Sutskever, and Hinton 2012). In addi-tion to demonstrated improvements in accuracy, the use ofa CNN removes the need for manual feature engineering.The ability to adapt or “fine-tune” large networks enablesthe re-use of learned lower levels of the network on new im-age collections while customizing the output nodes to theclasses of interest. Palafox et al. (2017) showed that a CNNout-performed a support vector machine classifier on findingMars landforms of interest. In previous work, we demon-strated the ability to fine-tune the AlexNet classifier for ap-plication to images collected by instruments in Mars orbitand on the Mars surface (Wagstaff et al. 2018). Other ap-proaches with relevance for planetary exploration are terrainclassification of regions within an image to inform naviga-tion (Rothrock et al. 2016) and generating text captions forplanetary images and enable a larger search vocabulary (Qiuet al. 2020), as opposed to a fixed set of image classes.
New Mars Classifier Data Sets
We created two new labeled data sets to train and eval-uate the latest versions of our Mars image classifiers.The HiRISE images were collected by the High Resolu-tion Imaging Experiment (HiRISE) instrument onboard theMars Reconnaissance Orbiter (MRO) (McEwen et al. 2007),while the MSL images were collected by the Mast Cam-era (Mastcam) and Mars Hand Lens Imager (MAHLI) in-struments mounted on Mars Science Laboratory (MSL) Cu-riosity rover (Grotzinger et al. 2012). To ensure high quality,the labels for both data sets were acquired using crowdsourc-ing with local volunteers who received specific training foreach data set. a r X i v : . [ c s . L G ] F e b lass Name Count Percent Bright dune 250 2.31%Crater 794 7.34%Dark dune 166 1.53%Impact ejecta 74 0.68%Other 8,802 81.39%Slope streak 267 2.47%Spider 164 1.52%Swiss cheese 298 2.76%Total 10,815 100%Table 1: HiRISE (Mars orbital) data set class distribution.
HiRISE Orbital Data Set (v3)
In previous work (Wagstaff et al. 2018), we compiled , images of Mars surface features that covered five classes ofinterest. The new HiRISE data set (v3) increases the num-ber of labeled images to , (before augmentation), witheight classes (Doran et al. 2020) .HiRISE images consist of long strips that cover up to km with a -km wide swath at a resolution of centime-ters/pixel. To identify surface features of interest, we em-ployed a focus of attention mechanism known as dynamiclandmarking (Wagstaff et al. 2012). This process scansthrough a large image to identify visually salient regions,which are termed landmarks . The salience of each pixel isdefined as a linear combination of the response to a Cannyfilter and the Earth mover’s distance (Rubner, Tomasi, andGuibas 1998) between the distribution of pixel intensity val-ues within a window around the pixel compared to valueswithin a larger enclosing window. We employed a genetic al-gorithm to optimize the parameters (analysis window sizes,weighting of individual filters, and salience threshold) basedon fourteen HiRISE images with hand-labeled salient re-gions. The salient landmarks within an image were obtainedby identifying connected components of regions that ex-ceed the salience threshold. We cropped the salient land-marks from the “browse” (reduced resolution) version ofeach HiRISE image using a square bounding box plus a -pixel border, then resized each image to × pixels.The resulting HiRISE image data set contains , landmark images derived from separate HiRISE sourceimages. The class distribution is shown in Table 1 in alpha-betical order. The classes are highly imbalanced, with themajority of images classified as “Other” and “Impact ejecta”the least common class.Examples of each class are shown in Figure 1. “Brightdune”, “Crater”, “Dark dune”, “Other”, and “Slope streak”classes were included in the v1 HiRISE data set. “Brightdune” and “Dark dune” are two sand dune classes found onMars. Dark dunes are completely defrosted, whereas brightdunes are not. The “Crater” class consists of crater imagesin which the diameter of the crater is greater than or equalto 1/5 the width of the image and the circular rim is visi-ble for at least half the crater’s circumference. The “Slopestreak” class consists of images of dark flow-like features on https://doi.org/10.5281/zenodo.4002935 (a) Bright dune (b)
Crater (c)
Dark dune (d)
Impact ejecta (e)
Other (f)
Slope streak (g)
Spider (h)
Swiss cheese
Figure 1: Examples of each class in the HiRISE v3 data set.slopes. These features are believed to be formed by a dryprocess in which overlying (bright) dust slides down a slopeand reveals a darker sub-surface. “Other” is a catch-all classthat contains images that fit none of the defined classes ofinterest (e.g., Figure 1(e)).We introduce three new classes of interest, which are “Im-pact ejecta”, “Spider”, and “Swiss cheese”. “Impact ejecta”refers to evidence of a meteorite impact on the surface. “Spi-ders” and “Swiss cheese” are phenomena that occur in thesouth polar region of Mars. Spiders have a central pit withradial troughs, and they are believed to form as a result ofseasonal jets expelling carbon dioxide gas through an over-lying ice layer (Aye et al. 2019). “Swiss cheese” is terrainthat consists of pits that are formed when the sun heats theice making it sublimate (change solid to gas).We used a combination of labeling platforms to labelthe HiRISE landmark images. Early images were labeledby in-house volunteers using the Zooniverse.org platform.We conducted a second labeling campaign that targetedthree minority classes: Impact ejecta, Spiders, and Swisscheese. Landmark images from this campaign were labeledusing the Interactive Data Analyzer and Reviewer (IDAR)browser-based image labeling tool . We obtained labels foreach image from three volunteers. Images for which thethree labels did not agree (for the second campaign, thisamounted to approximately 30% of the images) were man-ually reviewed to select the final label. To guide labelingwhen more than one class was present in the image, weinstructed volunteers to prioritize classes as Impact ejecta,Slope streak, Spider, Dark dune, Bright dune, Swiss cheese,Crater, or Other. MSL Surface Data Set (v2)
We created a new data set of Mars surface images collectedby the Mastcam and MAHLI instruments on the MSL Cu-riosity rover. Mastcam is a two-instrument suite with left-and right-eye cameras. MAHLI is a single focusable cameralocated on the turret at the end of the rover’s robotic arm. https://github.com/stevenlujpl/IDARa) Arm cover (b)
Artifact (c)
Close-up rock (d)
Dist. landscape (e)
Drill hole (f)
DRT (g)
DRT spot (h)
Float rock (i)
Layered rock (j)
L.-toned veins (k)
M. cal. target (l)
Nearby surface (m)
Night sky (n)
O. rover part (o)
Sand (p)
Sun (q)
Wheel (r)
Wheel joint (s)
Wheel tracks
Figure 2: Examples of each class in the MSL surface dataset. Subfigure (d) contains the circular shape of the sun,while subfigure (e) is an irregularly shaped moon of Mars.In our previous work (Wagstaff et al. 2018), we created adata set of , images spanning classes that primar-ily focused on rover hardware parts. The new data set (v2)includes , images spanning classes that primarily fo-cus on objects of scientific interest (Lu and Wagstaff 2020) .The MSL data set consists of RGB and grayscale imagesthat are 8-bit, decompressed, radiometrically calibrated,color corrected, and geometrically linearized browse im-ages from the MSL mission archive hosted by the PDSImaging Node. We resized the smallest side to pixelswhile preserving aspect ratio and center-cropped the otherside to pixels. We randomly sampled a total of , images from sol (MSL mission day) to , composedof , Mastcam left eye camera, , Mastcam right eyecamera, and
MAHLI images. https://doi.org/10.5281/zenodo.4033453 Class Name Count Percent
Arm cover , , , images covering the full sol range with the browser-based Class Discovery Tool from the IDAR software suite.This tool allows users to interactively associate images withdynamically created categories as they are discovered. Westarted with an initial list of classes from a domain experton the MSL science team and pre-sorted the images usingthe DEMUD algorithm (Wagstaff et al. 2013) so that themost “interesting” or unusual images were displayed first.The DEMUD algorithm is efficient in terms of class dis-covery (Wagstaff and Lu 2020). The class discovery processyielded classes of interest.Examples of each class are shown in Figure 2 in alpha-betical order. They include three classes that describe rover-created features (“Drill hole”,“DRT (Dust Removal Tool)spot”, and “Wheel tracks”), “Sun” and “Night sky”, sevenMars surface feature classes (e.g., “Light-toned veins”,“Layered rock”, “Float rock”), five rover part classes(e.g., “DRT”, “Mastcam calibration target”, “Wheel” and ageneric “Other rover part” class), and “Artifact” used for im-ages that are low in quality or contain missing data. Thepixel resolutions and lighting conditions in these imagesvary a lot as they were imaged at different distances and dif-ferent times.We used the IDAR labeling tool to label MSL surface dataset images. We divided the , images into batches,and each batch of images were distributed to three vol-unteers for labeling. We provided detailed labeling instruc-tions with definitions of each class and prioritization guid-ance when multiple classes appeared in a single image. Aswith the HiRISE data set, we resolved disagreement usingexpert review. The MSL surface data set class distribution isshown in Table 2. NN Classification for Mars Images
We trained and deployed two Convolutional Neural Network(CNN) classifiers, denoted as HiRISENet and MSLNet, forMRO HiRISE images and MSL Mastcam and MAHLI im-ages. We employed transfer learning to adapt the weightsof networks pre-trained on Earth images for use with Marsorbital and surface images.
HiRISENet: CNN Classifier for Mars OrbitalImages
We adapted the AlexNet image classifier (Krizhevsky,Sutskever, and Hinton 2012) for use with HiRISE classes.AlexNet was trained on 1.2 million Earth images from 1000classes in the ImageNet data set. We started with Caffe’sBVLC reference model (Jia et al. 2014), which is a copy ofAlexNet that was trained for , iterations, provided byJeff Donahue . We removed the final fully connected layer,added a new layer with eight output classes, and re-trainedthe network with Caffe (Jia et al. 2014). We followed Caffe’srecommendations for fine-tuning, including using a smallbase learning rate and small step size and a larger learningrate multiplier for the final layer only. We used a learningrate of 0.0001, weight decay of 0.0005, and relatively smallstep size of , . The initial layers were almost fixed; theyused learning rate multipliers of 1 (weight) and 2 (bias). Thefinal layer was allowed greater adaptation with multipliers of10 (weight) and 20 (bias). We trained the model for , iterations. Caffe computes the per-band mean pixel valuesfrom the training set and uses these values to normalize allimages during training and prediction.We split the HiRISE dataset into train, validation, and testsets using each landmark’s HiRISE source image identifierto ensure no overlap in source images between the sets. Weused approximately 65% of the data for training, 19% forvalidation, and 17% for testing. Images obtained from oursecond labeling campaign (to target minority classes) appearin the training and validation sets only so that we could as-sess improvements against an unchanged test set.We applied data augmentation to the training and valida-tion sets. The augmentation includes three rotations: 90, 180,and 270 degrees, horizontal and vertical flips, and a randombrightness adjustment. In addition, we up-sampled data ob-tained in the second labeling campaign by a factor of two. MSLNet: CNN Classifier for Mars Surface Images
MSLNet is a hybrid of two CNN classifiers. The version1 (v1) classifier focused on rover hardware classes (Wagstaffet al. 2018; Lu et al. 2019). The primary objectives of theMastcam and MAHLI instruments are to enable scienceanalysis of rover investigation sites, which motivated thecreation of version 2 (v2) classifier to expand the set ofclasses to include science targets (e.g. “Float rock”, “Lay-ered rock”) and activities (e.g. “DRT spot”, “Drill hole”).The v1 classifier initially focused on engineering considera-tions and rover hardware classes due to requests by the MSLrover planning team as well as pre-existing availability of https://github.com/BVLC/caffe/tree/master/models/bvlc referencecaffenet labels for those items. The “Wheel” class was of particularinterest due to growing awareness in 2017 that the rover’swheels were experiencing a higher than expected level ofdegradation due to driving on the rough surface. The suc-cess of the v1 classifier led to new requests to also accom-modate science-related classes in support of MSL missionto explore and understand Mars. Observations that containclasses such as “Layered rock” and “Light-toned veins” arevery high science priorities to help determine the history andevolution of water activity, which can also have implicationsfor habitability. The deployed MSLNet classifier covers bothareas of interest (engineering and science) to meet the needsof diverse users with different priorities.MSLNet first classifies images with the v2 classifier, thenreclassifies any images classified as “Other rover part” withthe v1 classifier to get a fine-grained classification of roverparts. The creation and evaluation of the v1 classifier werereported in previous work (Wagstaff et al. 2018). The v2classifier was trained and evaluated using the MSL surfacedata set described above. We divided this data set into train-ing, validation, and test data sets according to their sol ofacquisition to enable the evaluation of generalization perfor-mance on newly acquired images. The sol splitting bound-aries, as shown in Table 3, were chosen to enable per-cameradistributions that roughly match the full archive.To improve the generalization performance of the classi-fier, we augmented the images in the training data set (butnot the validation and test data sets). The MAHLI images(which come from a rotatable platform) were augmented us-ing rotation ( ◦ , ◦ , and ◦ ) and flipping (horizontaland vertical); the Mastcam images (which come from a fixedplatform) were augmented using only horizontal and verticalflipping methods.As with HiRISENet, for the MSLNet v2 classifier we fine-tuned AlexNet for , iterations with a fixed base learn-ing rate of . . We set the learning rate multipliers ofthe first four convolution layers, the fifth convolution layers,and the final fully connected layers to 0, 1, 20 respectively.We set the dropout rate to . for the first and second fullyconnected layers. The final hybrid MSLNet classifier com-bines v1 and v2 and classifies images into classes of bothscience and engineering relevance. Classifier Calibration
The deployed classifiers use a confidence threshold to deter-mine which results are shown to users, so it is vital that themodels are well calibrated. Modern neural networks haveachieved higher accuracies but in many cases have sufferedan increase in calibration error, which means that the pre-dicted class probabilities deviate from the true empiricalprobabilities. In many cases, the networks are consistentlyover-confident in their predictions. This effect appears tobe tied to an increase in model capacity and lack of reg-ularization (Guo et al. 2017). For a quantitative measure ofmodel calibration, we calculate the Expected Calibration Er-ror (ECE), which is the expected difference between pos-terior probability (confidence) and accuracy. We partition n predictions into M equally spaced bins and computed a rain sol 1 - 948 Val. sol 949 - 1920 Test sol 1921 - 2224 Full Archive sol 1 - 2224Instrument Count Percent Count Percent Count Percent Count Percent Mastcam Left , , , , , ( n = 6997 ) ( n = 2025 ) ( n = 1793 ) Classifier ( n aug = 51058 ) ( n aug = 14959 ) Most common 78.4% 75.0% 81.1%HiRISENet
Table 4: Classification accuracy on HiRISE (Mars orbital)images. The best performance on each data set is in bold.population-weighted average of the difference between ac-curacy and confidence within each bin:ECE = (cid:80) Mm =1 | B m | n | acc ( B m ) − conf ( B m ) | . We evaluated four post-hoc calibration methods that ex-tend Platt Scaling (Platt et al. 1999) to multiclass problems.Temperature scaling (Guo et al. 2017) uses a single param-eter T to rescale model output. Given the model output foritem x , which is a logit vector z ∈ R K , the calibrated prob-ability of class k is: p ( y = k | x ) = e zk/T (cid:80) Kj =1 e zj/T . The pa-rameter T is optimized with respect to the log likelihood onthe validation set. Since the parameter T does not changethe maximum of the softmax function, the accuracy ofthe model is unchanged. Bias-corrected temperature scaling(BCTS) (Alexandari, Kundaje, and Shrikumar 2020) adds abias term b k for each class: p ( y = k | x ) = e zk/T + bk (cid:80) Kj =1 e zj/T + bj .Vector and matrix scaling (Guo et al. 2017) add additionalflexibility with per-class scaling using a K × K linear trans-formation matrix W by computing Wz + b and then nor-malizing across classes to get p ( y = k | x ) . Vector scalingconstrains W to be a diagonal matrix whose entries func-tion as class-specific temperature values. CNN Classification Evaluation
To evaluate HiRISENet and MSLNet, we used the overall(post-threshold) accuracy score and abstention rate as theprimary performance metrics and ECE to measure the cali-bration level of the classifiers. We also analyzed the preci-sion and recall scores to understand per-class performance.
HiRISENet Evaluation
HiRISENet classification accuracy results are shown in Ta-ble 4. Random class prediction on this data set achieves11.1% accuracy (given eight classes). Compared to a simplebaseline that predicts the most common class from the train-ing set (“Other”), HiRISENet exhibits a strong improvementfrom 81.1% to to 92.8% on the test set. 0.9 confidenceECE Acc Acc Abst RateUncalibrated 0.073 88.6 94.2
Temperature scaling 0.013 88.6 97.3 31%BCTS 0.014 89.2 97.3 29%Vector scaling . %)as well as the lowest abstention rate ( %). Vector scal-ing achieved the lowest ECE but with higher abstention andlower accuracy. Therefore, we adopted matrix scaling for de-ployment. On the test set, the calibrated HiRISENet modelachieved . % accuracy with an abstention rate of %.Reliability diagrams (DeGroot and Fienberg 1983;Niculescu-Mizil and Caruana 2005) provide a visual repre- .0 0.2 0.4 0.6 0.8 1.0 Recall P r e c i s i o n othercraterdark duneslope streakbright duneimpact ejectaswiss cheesespider Figure 4: Calibrated HiRISENet per-class precision and re-call (test set).sentation of model calibration. The empirical per-bin accu-racy is plotted as a function of model posterior probability.For a perfectly calibrated model, these values are equal, fol-lowing the diagonal line. Figure 3 shows the reliability dia-gram for HiRISENet. The bottom panel shows the data setdistribution in terms of predicted probability. We find thatHiRISENet is well calibrated with an ECE of just . andthe majority of predictions in the most-confident bin.Figure 4 shows per-class precision and recall on the testset after matrix scaling calibration. Most classes achieve pre-cision above . with recall above . (even after thresh-olding). The “Spider” class has the lowest recall ( . , out ofonly items), while the “Impact ejecta” class has the low-est precision ( . ). Figure 5 shows the confusion matrix onthe test set after matrix scaling calibration. Diagonal (cor-rect) entries have a dark background. A comparison to theconfusion matrix before calibration (not shown) indicatesthat two images that were previously incorrectly classifiedinto the “Spider” class are now correctly classified as the“Impact ejecta” class; however, the confusion between the“Crater” and “Impact ejecta” classes has increased. In ad-dition, the “Spider” class suffered from significant domainshift, which is evident in Figure 6. The “Spider” images inthe validation set as shown in Figure 6(a) are extremely vi-sually different compared to the “Spider” images in the testset as shown in Figure 6(b). We found that even human la-belers had trouble recognizing them as the same phenomena.Future updates to this data set will target the “Spider” class. MSLNet Evaluation
The performance of the MSLNet v2 classifier is shown inTable 6 in comparison to the most-common-class (“Nearbysurface”) baseline. The MSLNet classifier significantly out-performs the baseline method, achieving . % accuracy, or . % with % abstention using a confidence threshold of . , on the test set. Recall that images in the training, valida-tion, and test sets were divided according to their sol (date) o t h e r ( ) c r a t e r ( ) d a r k d u n e ( ) s l o p e s t r e a k ( ) b r i g h t d u n e ( ) i m p a c t e j e c t a ( ) s w i ss c h ee s e ( ) s p i d e r ( ) Predicted label other (1482)crater (89)dark dune (66)slope streak (49)bright dune (16)impact ejecta (7)swiss cheese (42)spider (42) T r u e l a b e l Figure 5: Calibrated HiRISENet confusion matrix (test set). (a) Validation set (b) Test set
Figure 6: Domain shift in HiRISE “Spider” landmarks.of acquisition. The performance of the classifier graduallydecreases over time as the rover traversed to new locations,possibly due to label shift, in which the prior class probabil-ities change over space or time (Lipton, Wang, and Smola2018). We plan to investigate label shift adaptation to enablethe classifier to accommodate such change.MSLNet achieves lower accuracy and higher abstentionthan HiRISENet on its corresponding test set. Given thelarger number of classes and smaller number of labeled im-ages, we believe that this classifier is likely even more data-limited and would benefit from additional data collection.Reliable posterior probabilities are likewise essential forMSLNet. We calibrated the MSLNet classifier using tem-perature scaling, the most computationally efficient methodamong the four calibration methods discussed in this paper(e.g., matrix scaling scales quadratically with the numberof classes, which is problematic for MSLNet). After cali-bration, test set accuracy using the confidence threshold im-proved to . %, at the cost of increased abstention. For thisapplication, we are willing to sacrifice coverage to ensurethat the classifications provided to users are highly reliable.MSLNet’s ECE improved from . to . with temper-ature scaling. The reliability diagram of MSLNet after cali- rain (n=5920) Validation (n=300) Test (n=600) Acc Acc (0.9) Abst Rate Acc Acc (0.9) Abst Rate Acc Acc (0.9) Abst RateMost Common 26.3% - - 24.7% - - 31.2% - -MSLNet
MSLNet-calibrated . ; the second (intermediate) group includes five classes(e.g. “Layered rock”, “Drill hole”) whose F1 scores are be-tween . and . ; and the third group, those below thered curve, includes four classes (e.g. “Float rock”, “Wheeltracks”) whose F1 scores are less than . . We note that theclasses in the third group were evaluated on very few im-ages, so their performances are not statistically robust. Theseclasses require further improvement, and we plan to inves-tigate up-sampling or additional data acquisition to increasethe number of images of these classes. PDS Image Atlas Deployment
HiRISENet and MSLNet generate classifications that enablePDS users to quickly find images of interest via content-based search. The public interface to the PDS image archivesis the PDS Image Atlas . Users can select an instrument https://pds-imaging.jpl.nasa.gov/search/ Recall P r e c i s i o n F1 near equals 0.6F1 near equals 0.2Arm coverOther rover partArtifactNearby surfaceClose-up rockDRTDRT spotDistant landscapeDrill holeNight skyFloatLayersLight-toned veinsMastcam cal targetSandSunWheelWheel jointWheel tracks
Figure 8: Calibrated MSLNet per-class precision and recall(test set).Figure 9: View from the Atlas of craters found in HiRISEimage PSP 002912 2075 RED. - - - - - - - - - - - Month0100200300400500600700 Q u e r i e s (a) HiRISENet queries - - - - - - - - - - - Month020004000600080001000012000 Q u e r i e s (b) MSLNet queries Figure 10: Number of monthly queries for HiRISENet and MSLNet classifications over 3.5 years (colors distinguish years).(e.g., HiRISE) and filter the images to only contain a par-ticular class of interest. To enable this kind of search, weapplied HiRISENet and MSLNet to the full archive of im-ages collected by the relevant instruments on Mars. Thesearchives total , HiRISE and , , MSL images,far more than the labeled subsets used for training and eval-uation. Figure 9 shows the Atlas user view of a HiRISEimage with all confidently classified craters enclosed in redbounding boxes. Craters that are small, faint, degraded, ordistorted are less likely to pass the confidence threshold, butthose identified as craters are highly reliable. In response touser requests, we added the ability to download a file thatcontains the latitude and longitude coordinates of each de-tected landmark, using PDSC to convert from pixel to geo-graphic coordinates.Classifying all HiRISE images took about five days ona GPU system and yielded , landmarks with a poste-rior probability of at least . , from classes that were not“Other”. We also filtered out predictions for “Spider” or“Swiss cheese” at latitudes outside of the south polar re-gion (Aye et al. 2019). This total represents an 81% in-crease over the number of classified landmarks available inthe first classifier release (Wagstaff et al. 2018). MSLNetclassified , images with a posterior probability of atleast . . Both classifiers are integrated into the data inges-tion pipeline for the Atlas. As new data is delivered fromHiRISE or MSL, the images are automatically processed andtagged with their predicted classes.We track the number of Atlas queries that make useof HiRISENet and MSLNet classifications. As shown inFigure 10(a), HiRISENet exhibits regular and increasingusage over time. The most popular HiRISE class to bequeried is “Crater”. MSLNet shows more varied activity(Figure 10(b)), dominated by heavy usage in early 2019when the number of queries for “Wheel” increased by sev- https://github.com/JPLMLIA/pdsc eral orders of magnitude (note the difference in y axisrange). Given the small separation in query timestamps,most likely it was the result of a large number of scriptedqueries to the Atlas, which provides a public API. It is pos-sible that this classifier output is serving to help train otherinvestigators’ models.We also analyzed the top 20 domain names from whichthe queries came. From January to July of 2020, we foundthat 40% of these queries came from hosts through an ISP,including Spectrum and Comcast as well as ISPs in the U.K.,the Netherlands, Romania, and Taiwan. Another 33% ofthese queries came from JPL domains, which is not surpris-ing given the relevance of the content to JPL projects. A mi-nority of the top 20 domain queries came from the RemoteSensing Technology Center of Japan (2%), the University ofWyoming (1%), and SoftBank (Japan) (1%), while 23% ofhostnames did not resolve to a domain.Finally, we used Google Analytics to examine the globaldistribution of visitors who made classification queries. Be-tween July 2017 (the oldest data available) and August 2020,there were , visitors. The top ten countries and shareof visitors were: United States (51%), India (6%), UnitedKingdom (4%), Germany (3%), Canada (2%), France (2%),Italy (2%), Spain (2%), Australia (2%), and Russia (2%). Inall, visitors came from 180 different countries. Lessons learned.
The deployment of Mars image classi-fiers at scale has yielded several lessons learned. First, it isworth highlighting the challenge of defining meaningful andrelevant classes up front. Unlike a fixed benchmark data set,new Mars images are continually collected and new classescould arise at any time. Collaboration with domain expertsis vital for ensuring the relevance of the class definitions.Second, domain shift is an important factor in both datasets. Figure 11 compares the class distribution (excluding“Other”) for the labeled HiRISE data set (brown) to thepredictions made across the full HiRISE archive (orange). r i g h t d u n e c r a t e r d a r k d u n e i m p a c t e j e c t a s l o p e s t r e a k s p i d e r s w i s s c h e e s e F r a c t i o n o f i m a g e s LabeledDeployed
Figure 11: Class distribution for HiRISE labeled data set ver-sus predictions on the full archive.While the “Crater” class is dominant in both, its share of theimages classified nearly doubles when deployed. There arelikely two factors involved: our labeled data set is not fullyrepresentative of all of Mars, and “Crater” predictions mayin general be more confident and thus more likely to passthe 0.9 threshold and be retained here. Similar effects areseen in the MSL data set. We are currently investigating theuse of label shift adaptation increase the robustness of bothclassifiers.Finally, we found that our initial simplifying assumptionof one class per image is sometimes inadequate (a cratermight contain a dark slope streak; an MSL image mightcontain rover parts and the horizon). Even with guidance onclass priorities, human labelers sometimes found it difficultto select a single class label. We plan to adopt a multi-labelapproach in future versions to allow more flexibility and re-duce label noise.
Conclusions and Future Work
This paper presents the latest updates to the deployment ofmachine learning image classifiers in support of planetaryscience. Two classifiers, for orbital and surface images ofMars, have been operating since late 2016 to enable the firstcontent-based search of large NASA image archives. Us-age has increased over time, and feedback from users aswell as internal assessments have guided recent improve-ments. These include acquiring additional training data toimprove minority class performance for the HiRISE classi-fier, defining new classes of scientific interest for the MSLclassifier, employing calibration to increase classifier relia-bility, and making landmark coordinates downloadable. Toincrease our understanding of how and why users employthe machine learning classifications of Mars images, we areinvestigating minimally intrusive ways to learn more aboutuser motivations and use cases. Meanwhile, the performanceand scope of these classifiers continues to grow. Each timenew images are delivered by the instruments at Mars, they are automatically classified and added to the archive.We are currently developing a new classifier, MERNet,that will operate on images collected by the two PanoramicCamera (Pancam) instruments on the Opportunity and SpiritMars Exploration Rovers (MER) rovers. Based on ourlessons learned, we are employing a multi-label approachso multiple labels can be assigned to a single image. MER-Net will classify all Pancam images using classes of bothscience and engineering interest that were identified in a sur-vey conducted by the MER Data Catalog project (Cole et al.2020). MERNet will provide users with the first content-based search capability for MER images.Finally, we plan to incorporate label shift adapta-tion (Alexandari, Kundaje, and Shrikumar 2020) into futureupgrades of the Mars image classifiers. It is evident that thedata collected by Mars instruments is not i.i.d.; class fre-quencies change as spacecraft study different locations onMars, globally from orbit or locally via rover traverse. Byenabling the classifiers to adapt to class probability changes,we expect to obtain more reliable classifications.
Acknowledgments
We thank Michael McAuley from the PDS Imaging Nodefor the continuing support of this work and Anil Natha forassistance with the Google Analytics results. We also thankthe numerous volunteers who helped label the Mars im-ages. Part of this research was carried out at the Jet Propul-sion Laboratory, California Institute of Technology, undera contract with the National Aeronautics and Space Ad-ministration. This publication uses data generated via theZooniverse.org platform, development of which is fundedby generous support, including a Global Impact Award fromGoogle, and by a grant from the Alfred P. Sloan Foundation.
References
Alexandari, A.; Kundaje, A.; and Shrikumar, A. 2020. Max-imum likelihood with bias-corrected calibration is hard-to-beat at label shift adaptation. In
Proceedings of the 2020International Conference on Machine Learning , 222–232.Aye, K.-M.; Schwamb, M. E.; Portyankina, G.; Hansen,C. J.; McMaster, A.; Miller, G. R. M.; Carstensen, B.; Sny-der, C.; Parrish, M.; Lynn, S.; Maic, C.; Miller, D.; Simpson,R. J.; and Smith, A. M. 2019. Planet Four: Probing spring-time winds on Mars by mapping the southern polar CO jetdeposits. Icarus
Proceed-ings of the 51st Lunar and Planetary Science Conference ,Abstract
Journal of the Royal Statisti-cal Society: Series D (The Statistician)
Space Science Reviews
Proceedingsof the 34th International Conference on Machine Learning ,1321–1330.Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.;Girshick, R.; Guadarrama, S.; and Darrell, T. 2014. Caffe:Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 .Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Im-ageNet classification with deep convolutional neural net-works. In
Advances in Neural Information Processing Sys-tems 25 , 1097–1105.Lipton, Z. C.; Wang, Y.-X.; and Smola, A. 2018. Detectingand correcting for label shift with black box predictors. In
Proceedings of the 2018 International Conference on Ma-chine Learning , 3128–3136.Lu, S.; and Wagstaff, K. L. 2020. MSL Curiosity roverimages with science and engineering classes, version 2.1.0.URL http://doi.org/10.5281/zenodo.4033453.Lu, S.; Wagstaff, K. L.; Cai, J.; Doran, G.; Grimes, K.; Lee,J.; Mandrake, L.; and Yue., Y. 2019. Improved content-basedimage classifiers for the PDS Image Atlas. In .McEwen, A. S.; Eliason, E. M.; Bergstrom, J. W.; Bridges,N. T.; Hansen, C. J.; Delamere, W. A.; Grant, J. A.; Gulick,V. A.; Herkenhoff, K. E.; Keszthelyi, L.; Kirkand, R. L.;Mellon, M. T.; Squyres, S. W.; Thomas, N.; and Weitz,C. M. 2007. Mars Reconnaissance Orbiter’s High Res-olution Imaging Science Experiment (HiRISE).
Journalof Geophysical Research (Planets)
Proceedings of the22nd International Conference on Machine learning , 625–632.Palafox, L. F.; Hamilton, C. W.; Scheidt, S. P.; and Alvarez,A. M. 2017. Automated detection of geological landformson Mars using Convolutional Neural Networks.
Computers& Geosciences
Advances in large margin classifiers
Planetary and Space Science
Proceedingsof the AIAA SPACE Forum .Rubner, Y.; Tomasi, C.; and Guibas, L. J. 1998. A metric fordistributions with applications to image databases. In
Pro-ceedings of the Sixth International Conference on ComputerVision , 59–66. doi:10.1109/ICCV.1998.710701.Wagstaff, K. L.; Lanza, N. L.; Thompson, D. R.; Dietterich,T. G.; and Gilmore, M. S. 2013. Guiding scientific discov-ery with explanations using DEMUD. In
Proceedings of theTwenty-Seventh Conference on Artificial Intelligence , 905–911.Wagstaff, K. L.; and Lu, S. 2020. Efficient active learningfor new domains. In
Proceedings of the 2020 ICML on RealWorld Experiment Design and Active Learning .Wagstaff, K. L.; Lu, Y.; Stanboli, A.; Grimes, K.; Gowda,T.; and Padams, J. 2018. Deep Mars: CNN classification ofMars imagery for the PDS Imaging Atlas. In
Proceedings ofthe Thirtieth Annual Conference on Innovative Applicationsof Artificial Intelligence , 7867–7872.Wagstaff, K. L.; Panetta, J.; Ansar, A.; Greeley, R.; Hoffer,M. P.; Bunte, M.; and Schorghofer, N. 2012. Dynamic land-marking for surface feature identification and change detec-tion.