[PDF] Mars Image Content Classification: Three Years of NASA Deployment and Recent Advances

Abstract

The NASA Planetary Data System hosts millions of images acquired from the planet Mars. To help users quickly find images of interest, we have developed and deployed content-based classification and search capabilities for Mars orbital and surface images. The deployed systems are publicly accessible using the PDS Image Atlas. We describe the process of training, evaluating, calibrating, and deploying updates to two CNN classifiers for images collected by Mars missions. We also report on three years of deployment including usage statistics, lessons learned, and plans for the future.

Full PDF

MMars Image Content Classiﬁcation:Three Years of NASA Deployment and Recent Advances

Kiri Wagstaff , Steven Lu , Emily Dunkel , Kevin Grimes , Brandon Zhao , Jesse Cai ,Shoshanna B. Cole , Gary Doran , Raymond Francis , Jake Lee , and Lukas Mandrake Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA, 91109-8099, USA Duke University, 2138 Campus Drive, Durham, NC 27708, USA California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA Space Science Institute, 4765 Walnut St, Suite B, Boulder, CO 80301, USA { kiri.l.wagstaff,you.lu,emily.dunkel,kevin.m.grimes } @jpl.nasa.gov, [email protected], [email protected],[email protected], { gary.b.doran.jr,raymond.francis,jake.h.lee,lukas.mandrake } @jpl.nasa.gov Abstract

The NASA Planetary Data System hosts millions of imagesacquired from the planet Mars. To help users quickly ﬁnd im-ages of interest, we have developed and deployed content-based classiﬁcation and search capabilities for Mars orbitaland surface images. The deployed systems are publicly ac-cessible using the PDS Image Atlas. We describe the processof training, evaluating, calibrating, and deploying updates totwo CNN classiﬁers for images collected by Mars missions.We also report on three years of deployment including usagestatistics, lessons learned, and plans for the future.

Introduction

The NASA Planetary Data System (PDS) maintains archivesof data collected by NASA missions that explore our so-lar system. The PDS Cartography and Imaging SciencesNode (Imaging Node) provides access to millions of imagesof planets, moons, comets, and other bodies. Given the largeand continually growing volume of data, there is a need fortools that enable users to quickly search for images of inter-est. Each image product is described by a rich set of search-able metadata properties such as the time it was collected,the instrument used, the image target, local season, etc.However, users often wish to search on the content of theimage to zero in on those images most relevant to a scien-tiﬁc investigation or individual curiosity. Manually search-ing through millions of images is infeasible. In previouswork, we trained image classiﬁers to detect classes of in-terest in Mars orbital and surface images (Wagstaff et al.2018). Using the predictions made by these classiﬁers, userscan interactively search for classes of interest using the PDSImage Atlas . Since the deployment of these classiﬁers inlate 2016 and through August 2020, their predictions havebeen used to satisfy , searches on the Atlas website.In this paper, we report on several new advances withinthis domain. First, we expanded the set of classes known toeach classiﬁer to broaden their coverage of different con-tent types. Second, we employed classiﬁer calibration to https://pds-imaging.jpl.nasa.gov/search/ produce more reliable posterior probabilities, which is vi-tal since only classiﬁcations with a posterior probability ofat least 0.9 are displayed to users. Finally, we now report onthree years of deployment including usage statistics, lessonslearned, and plans for the future. Related Work

Machine learning image classiﬁcation has achieved highlevels of performance since the adoption of convolu-tional neural networks (CNNs) trained on millions of im-ages (Krizhevsky, Sutskever, and Hinton 2012). In addi-tion to demonstrated improvements in accuracy, the use ofa CNN removes the need for manual feature engineering.The ability to adapt or “ﬁne-tune” large networks enablesthe re-use of learned lower levels of the network on new im-age collections while customizing the output nodes to theclasses of interest. Palafox et al. (2017) showed that a CNNout-performed a support vector machine classiﬁer on ﬁndingMars landforms of interest. In previous work, we demon-strated the ability to ﬁne-tune the AlexNet classiﬁer for ap-plication to images collected by instruments in Mars orbitand on the Mars surface (Wagstaff et al. 2018). Other ap-proaches with relevance for planetary exploration are terrainclassiﬁcation of regions within an image to inform naviga-tion (Rothrock et al. 2016) and generating text captions forplanetary images and enable a larger search vocabulary (Qiuet al. 2020), as opposed to a ﬁxed set of image classes.

New Mars Classiﬁer Data Sets

We created two new labeled data sets to train and eval-uate the latest versions of our Mars image classiﬁers.The HiRISE images were collected by the High Resolu-tion Imaging Experiment (HiRISE) instrument onboard theMars Reconnaissance Orbiter (MRO) (McEwen et al. 2007),while the MSL images were collected by the Mast Cam-era (Mastcam) and Mars Hand Lens Imager (MAHLI) in-struments mounted on Mars Science Laboratory (MSL) Cu-riosity rover (Grotzinger et al. 2012). To ensure high quality,the labels for both data sets were acquired using crowdsourc-ing with local volunteers who received speciﬁc training foreach data set. a r X i v : . [ c s . L G ] F e b lass Name Count Percent Bright dune 250 2.31%Crater 794 7.34%Dark dune 166 1.53%Impact ejecta 74 0.68%Other 8,802 81.39%Slope streak 267 2.47%Spider 164 1.52%Swiss cheese 298 2.76%Total 10,815 100%Table 1: HiRISE (Mars orbital) data set class distribution.

HiRISE Orbital Data Set (v3)

In previous work (Wagstaff et al. 2018), we compiled , images of Mars surface features that covered ﬁve classes ofinterest. The new HiRISE data set (v3) increases the num-ber of labeled images to , (before augmentation), witheight classes (Doran et al. 2020) .HiRISE images consist of long strips that cover up to km with a -km wide swath at a resolution of centime-ters/pixel. To identify surface features of interest, we em-ployed a focus of attention mechanism known as dynamiclandmarking (Wagstaff et al. 2012). This process scansthrough a large image to identify visually salient regions,which are termed landmarks . The salience of each pixel isdeﬁned as a linear combination of the response to a Cannyﬁlter and the Earth mover’s distance (Rubner, Tomasi, andGuibas 1998) between the distribution of pixel intensity val-ues within a window around the pixel compared to valueswithin a larger enclosing window. We employed a genetic al-gorithm to optimize the parameters (analysis window sizes,weighting of individual ﬁlters, and salience threshold) basedon fourteen HiRISE images with hand-labeled salient re-gions. The salient landmarks within an image were obtainedby identifying connected components of regions that ex-ceed the salience threshold. We cropped the salient land-marks from the “browse” (reduced resolution) version ofeach HiRISE image using a square bounding box plus a -pixel border, then resized each image to × pixels.The resulting HiRISE image data set contains , landmark images derived from separate HiRISE sourceimages. The class distribution is shown in Table 1 in alpha-betical order. The classes are highly imbalanced, with themajority of images classiﬁed as “Other” and “Impact ejecta”the least common class.Examples of each class are shown in Figure 1. “Brightdune”, “Crater”, “Dark dune”, “Other”, and “Slope streak”classes were included in the v1 HiRISE data set. “Brightdune” and “Dark dune” are two sand dune classes found onMars. Dark dunes are completely defrosted, whereas brightdunes are not. The “Crater” class consists of crater imagesin which the diameter of the crater is greater than or equalto 1/5 the width of the image and the circular rim is visi-ble for at least half the crater’s circumference. The “Slopestreak” class consists of images of dark ﬂow-like features on https://doi.org/10.5281/zenodo.4002935 (a) Bright dune (b)

Crater (c)

Dark dune (d)

Impact ejecta (e)

Other (f)

Slope streak (g)

Spider (h)

Swiss cheese

Figure 1: Examples of each class in the HiRISE v3 data set.slopes. These features are believed to be formed by a dryprocess in which overlying (bright) dust slides down a slopeand reveals a darker sub-surface. “Other” is a catch-all classthat contains images that ﬁt none of the deﬁned classes ofinterest (e.g., Figure 1(e)).We introduce three new classes of interest, which are “Im-pact ejecta”, “Spider”, and “Swiss cheese”. “Impact ejecta”refers to evidence of a meteorite impact on the surface. “Spi-ders” and “Swiss cheese” are phenomena that occur in thesouth polar region of Mars. Spiders have a central pit withradial troughs, and they are believed to form as a result ofseasonal jets expelling carbon dioxide gas through an over-lying ice layer (Aye et al. 2019). “Swiss cheese” is terrainthat consists of pits that are formed when the sun heats theice making it sublimate (change solid to gas).We used a combination of labeling platforms to labelthe HiRISE landmark images. Early images were labeledby in-house volunteers using the Zooniverse.org platform.We conducted a second labeling campaign that targetedthree minority classes: Impact ejecta, Spiders, and Swisscheese. Landmark images from this campaign were labeledusing the Interactive Data Analyzer and Reviewer (IDAR)browser-based image labeling tool . We obtained labels foreach image from three volunteers. Images for which thethree labels did not agree (for the second campaign, thisamounted to approximately 30% of the images) were man-ually reviewed to select the ﬁnal label. To guide labelingwhen more than one class was present in the image, weinstructed volunteers to prioritize classes as Impact ejecta,Slope streak, Spider, Dark dune, Bright dune, Swiss cheese,Crater, or Other. MSL Surface Data Set (v2)

We created a new data set of Mars surface images collectedby the Mastcam and MAHLI instruments on the MSL Cu-riosity rover. Mastcam is a two-instrument suite with left-and right-eye cameras. MAHLI is a single focusable cameralocated on the turret at the end of the rover’s robotic arm. https://github.com/stevenlujpl/IDARa) Arm cover (b)

Artifact (c)

Close-up rock (d)

Dist. landscape (e)

Drill hole (f)

DRT (g)

DRT spot (h)

Float rock (i)

Layered rock (j)

L.-toned veins (k)

M. cal. target (l)

Nearby surface (m)

Night sky (n)

O. rover part (o)

Sand (p)

Sun (q)

Wheel (r)

Wheel joint (s)

Wheel tracks

Figure 2: Examples of each class in the MSL surface dataset. Subﬁgure (d) contains the circular shape of the sun,while subﬁgure (e) is an irregularly shaped moon of Mars.In our previous work (Wagstaff et al. 2018), we created adata set of , images spanning classes that primar-ily focused on rover hardware parts. The new data set (v2)includes , images spanning classes that primarily fo-cus on objects of scientiﬁc interest (Lu and Wagstaff 2020) .The MSL data set consists of RGB and grayscale imagesthat are 8-bit, decompressed, radiometrically calibrated,color corrected, and geometrically linearized browse im-ages from the MSL mission archive hosted by the PDSImaging Node. We resized the smallest side to pixelswhile preserving aspect ratio and center-cropped the otherside to pixels. We randomly sampled a total of , images from sol (MSL mission day) to , composedof , Mastcam left eye camera, , Mastcam right eyecamera, and

MAHLI images. https://doi.org/10.5281/zenodo.4033453 Class Name Count Percent

Arm cover , , , images covering the full sol range with the browser-based Class Discovery Tool from the IDAR software suite.This tool allows users to interactively associate images withdynamically created categories as they are discovered. Westarted with an initial list of classes from a domain experton the MSL science team and pre-sorted the images usingthe DEMUD algorithm (Wagstaff et al. 2013) so that themost “interesting” or unusual images were displayed ﬁrst.The DEMUD algorithm is efﬁcient in terms of class dis-covery (Wagstaff and Lu 2020). The class discovery processyielded classes of interest.Examples of each class are shown in Figure 2 in alpha-betical order. They include three classes that describe rover-created features (“Drill hole”,“DRT (Dust Removal Tool)spot”, and “Wheel tracks”), “Sun” and “Night sky”, sevenMars surface feature classes (e.g., “Light-toned veins”,“Layered rock”, “Float rock”), ﬁve rover part classes(e.g., “DRT”, “Mastcam calibration target”, “Wheel” and ageneric “Other rover part” class), and “Artifact” used for im-ages that are low in quality or contain missing data. Thepixel resolutions and lighting conditions in these imagesvary a lot as they were imaged at different distances and dif-ferent times.We used the IDAR labeling tool to label MSL surface dataset images. We divided the , images into batches,and each batch of images were distributed to three vol-unteers for labeling. We provided detailed labeling instruc-tions with deﬁnitions of each class and prioritization guid-ance when multiple classes appeared in a single image. Aswith the HiRISE data set, we resolved disagreement usingexpert review. The MSL surface data set class distribution isshown in Table 2. NN Classiﬁcation for Mars Images

We trained and deployed two Convolutional Neural Network(CNN) classiﬁers, denoted as HiRISENet and MSLNet, forMRO HiRISE images and MSL Mastcam and MAHLI im-ages. We employed transfer learning to adapt the weightsof networks pre-trained on Earth images for use with Marsorbital and surface images.

HiRISENet: CNN Classiﬁer for Mars OrbitalImages

We adapted the AlexNet image classiﬁer (Krizhevsky,Sutskever, and Hinton 2012) for use with HiRISE classes.AlexNet was trained on 1.2 million Earth images from 1000classes in the ImageNet data set. We started with Caffe’sBVLC reference model (Jia et al. 2014), which is a copy ofAlexNet that was trained for , iterations, provided byJeff Donahue . We removed the ﬁnal fully connected layer,added a new layer with eight output classes, and re-trainedthe network with Caffe (Jia et al. 2014). We followed Caffe’srecommendations for ﬁne-tuning, including using a smallbase learning rate and small step size and a larger learningrate multiplier for the ﬁnal layer only. We used a learningrate of 0.0001, weight decay of 0.0005, and relatively smallstep size of , . The initial layers were almost ﬁxed; theyused learning rate multipliers of 1 (weight) and 2 (bias). Theﬁnal layer was allowed greater adaptation with multipliers of10 (weight) and 20 (bias). We trained the model for , iterations. Caffe computes the per-band mean pixel valuesfrom the training set and uses these values to normalize allimages during training and prediction.We split the HiRISE dataset into train, validation, and testsets using each landmark’s HiRISE source image identiﬁerto ensure no overlap in source images between the sets. Weused approximately 65% of the data for training, 19% forvalidation, and 17% for testing. Images obtained from oursecond labeling campaign (to target minority classes) appearin the training and validation sets only so that we could as-sess improvements against an unchanged test set.We applied data augmentation to the training and valida-tion sets. The augmentation includes three rotations: 90, 180,and 270 degrees, horizontal and vertical ﬂips, and a randombrightness adjustment. In addition, we up-sampled data ob-tained in the second labeling campaign by a factor of two. MSLNet: CNN Classiﬁer for Mars Surface Images

MSLNet is a hybrid of two CNN classiﬁers. The version1 (v1) classiﬁer focused on rover hardware classes (Wagstaffet al. 2018; Lu et al. 2019). The primary objectives of theMastcam and MAHLI instruments are to enable scienceanalysis of rover investigation sites, which motivated thecreation of version 2 (v2) classiﬁer to expand the set ofclasses to include science targets (e.g. “Float rock”, “Lay-ered rock”) and activities (e.g. “DRT spot”, “Drill hole”).The v1 classiﬁer initially focused on engineering considera-tions and rover hardware classes due to requests by the MSLrover planning team as well as pre-existing availability of https://github.com/BVLC/caffe/tree/master/models/bvlc referencecaffenet labels for those items. The “Wheel” class was of particularinterest due to growing awareness in 2017 that the rover’swheels were experiencing a higher than expected level ofdegradation due to driving on the rough surface. The suc-cess of the v1 classiﬁer led to new requests to also accom-modate science-related classes in support of MSL missionto explore and understand Mars. Observations that containclasses such as “Layered rock” and “Light-toned veins” arevery high science priorities to help determine the history andevolution of water activity, which can also have implicationsfor habitability. The deployed MSLNet classiﬁer covers bothareas of interest (engineering and science) to meet the needsof diverse users with different priorities.MSLNet ﬁrst classiﬁes images with the v2 classiﬁer, thenreclassiﬁes any images classiﬁed as “Other rover part” withthe v1 classiﬁer to get a ﬁne-grained classiﬁcation of roverparts. The creation and evaluation of the v1 classiﬁer werereported in previous work (Wagstaff et al. 2018). The v2classiﬁer was trained and evaluated using the MSL surfacedata set described above. We divided this data set into train-ing, validation, and test data sets according to their sol ofacquisition to enable the evaluation of generalization perfor-mance on newly acquired images. The sol splitting bound-aries, as shown in Table 3, were chosen to enable per-cameradistributions that roughly match the full archive.To improve the generalization performance of the classi-ﬁer, we augmented the images in the training data set (butnot the validation and test data sets). The MAHLI images(which come from a rotatable platform) were augmented us-ing rotation ( ◦ , ◦ , and ◦ ) and ﬂipping (horizontaland vertical); the Mastcam images (which come from a ﬁxedplatform) were augmented using only horizontal and verticalﬂipping methods.As with HiRISENet, for the MSLNet v2 classiﬁer we ﬁne-tuned AlexNet for , iterations with a ﬁxed base learn-ing rate of . . We set the learning rate multipliers ofthe ﬁrst four convolution layers, the ﬁfth convolution layers,and the ﬁnal fully connected layers to 0, 1, 20 respectively.We set the dropout rate to . for the ﬁrst and second fullyconnected layers. The ﬁnal hybrid MSLNet classiﬁer com-bines v1 and v2 and classiﬁes images into classes of bothscience and engineering relevance. Classiﬁer Calibration

The deployed classiﬁers use a conﬁdence threshold to deter-mine which results are shown to users, so it is vital that themodels are well calibrated. Modern neural networks haveachieved higher accuracies but in many cases have sufferedan increase in calibration error, which means that the pre-dicted class probabilities deviate from the true empiricalprobabilities. In many cases, the networks are consistentlyover-conﬁdent in their predictions. This effect appears tobe tied to an increase in model capacity and lack of reg-ularization (Guo et al. 2017). For a quantitative measure ofmodel calibration, we calculate the Expected Calibration Er-ror (ECE), which is the expected difference between pos-terior probability (conﬁdence) and accuracy. We partition n predictions into M equally spaced bins and computed a rain sol 1 - 948 Val. sol 949 - 1920 Test sol 1921 - 2224 Full Archive sol 1 - 2224Instrument Count Percent Count Percent Count Percent Count Percent Mastcam Left , , , , , ( n = 6997 ) ( n = 2025 ) ( n = 1793 ) Classiﬁer ( n aug = 51058 ) ( n aug = 14959 ) Most common 78.4% 75.0% 81.1%HiRISENet

Table 4: Classiﬁcation accuracy on HiRISE (Mars orbital)images. The best performance on each data set is in bold.population-weighted average of the difference between ac-curacy and conﬁdence within each bin:ECE = (cid:80) Mm =1 | B m | n | acc ( B m ) − conf ( B m ) | . We evaluated four post-hoc calibration methods that ex-tend Platt Scaling (Platt et al. 1999) to multiclass problems.Temperature scaling (Guo et al. 2017) uses a single param-eter T to rescale model output. Given the model output foritem x , which is a logit vector z ∈ R K , the calibrated prob-ability of class k is: p ( y = k | x ) = e zk/T (cid:80) Kj =1 e zj/T . The pa-rameter T is optimized with respect to the log likelihood onthe validation set. Since the parameter T does not changethe maximum of the softmax function, the accuracy ofthe model is unchanged. Bias-corrected temperature scaling(BCTS) (Alexandari, Kundaje, and Shrikumar 2020) adds abias term b k for each class: p ( y = k | x ) = e zk/T + bk (cid:80) Kj =1 e zj/T + bj .Vector and matrix scaling (Guo et al. 2017) add additionalﬂexibility with per-class scaling using a K × K linear trans-formation matrix W by computing Wz + b and then nor-malizing across classes to get p ( y = k | x ) . Vector scalingconstrains W to be a diagonal matrix whose entries func-tion as class-speciﬁc temperature values. CNN Classiﬁcation Evaluation

To evaluate HiRISENet and MSLNet, we used the overall(post-threshold) accuracy score and abstention rate as theprimary performance metrics and ECE to measure the cali-bration level of the classiﬁers. We also analyzed the preci-sion and recall scores to understand per-class performance.

HiRISENet Evaluation

HiRISENet classiﬁcation accuracy results are shown in Ta-ble 4. Random class prediction on this data set achieves11.1% accuracy (given eight classes). Compared to a simplebaseline that predicts the most common class from the train-ing set (“Other”), HiRISENet exhibits a strong improvementfrom 81.1% to to 92.8% on the test set. 0.9 conﬁdenceECE Acc Acc Abst RateUncalibrated 0.073 88.6 94.2

Temperature scaling 0.013 88.6 97.3 31%BCTS 0.014 89.2 97.3 29%Vector scaling . %)as well as the lowest abstention rate ( %). Vector scal-ing achieved the lowest ECE but with higher abstention andlower accuracy. Therefore, we adopted matrix scaling for de-ployment. On the test set, the calibrated HiRISENet modelachieved . % accuracy with an abstention rate of %.Reliability diagrams (DeGroot and Fienberg 1983;Niculescu-Mizil and Caruana 2005) provide a visual repre- .0 0.2 0.4 0.6 0.8 1.0 Recall P r e c i s i o n othercraterdark duneslope streakbright duneimpact ejectaswiss cheesespider Figure 4: Calibrated HiRISENet per-class precision and re-call (test set).sentation of model calibration. The empirical per-bin accu-racy is plotted as a function of model posterior probability.For a perfectly calibrated model, these values are equal, fol-lowing the diagonal line. Figure 3 shows the reliability dia-gram for HiRISENet. The bottom panel shows the data setdistribution in terms of predicted probability. We ﬁnd thatHiRISENet is well calibrated with an ECE of just . andthe majority of predictions in the most-conﬁdent bin.Figure 4 shows per-class precision and recall on the testset after matrix scaling calibration. Most classes achieve pre-cision above . with recall above . (even after thresh-olding). The “Spider” class has the lowest recall ( . , out ofonly items), while the “Impact ejecta” class has the low-est precision ( . ). Figure 5 shows the confusion matrix onthe test set after matrix scaling calibration. Diagonal (cor-rect) entries have a dark background. A comparison to theconfusion matrix before calibration (not shown) indicatesthat two images that were previously incorrectly classiﬁedinto the “Spider” class are now correctly classiﬁed as the“Impact ejecta” class; however, the confusion between the“Crater” and “Impact ejecta” classes has increased. In ad-dition, the “Spider” class suffered from signiﬁcant domainshift, which is evident in Figure 6. The “Spider” images inthe validation set as shown in Figure 6(a) are extremely vi-sually different compared to the “Spider” images in the testset as shown in Figure 6(b). We found that even human la-belers had trouble recognizing them as the same phenomena.Future updates to this data set will target the “Spider” class. MSLNet Evaluation

The performance of the MSLNet v2 classiﬁer is shown inTable 6 in comparison to the most-common-class (“Nearbysurface”) baseline. The MSLNet classiﬁer signiﬁcantly out-performs the baseline method, achieving . % accuracy, or . % with % abstention using a conﬁdence threshold of . , on the test set. Recall that images in the training, valida-tion, and test sets were divided according to their sol (date) o t h e r ( ) c r a t e r ( ) d a r k d u n e ( ) s l o p e s t r e a k ( ) b r i g h t d u n e ( ) i m p a c t e j e c t a ( ) s w i ss c h ee s e ( ) s p i d e r ( ) Predicted label other (1482)crater (89)dark dune (66)slope streak (49)bright dune (16)impact ejecta (7)swiss cheese (42)spider (42) T r u e l a b e l Figure 5: Calibrated HiRISENet confusion matrix (test set). (a) Validation set (b) Test set

Figure 6: Domain shift in HiRISE “Spider” landmarks.of acquisition. The performance of the classiﬁer graduallydecreases over time as the rover traversed to new locations,possibly due to label shift, in which the prior class probabil-ities change over space or time (Lipton, Wang, and Smola2018). We plan to investigate label shift adaptation to enablethe classiﬁer to accommodate such change.MSLNet achieves lower accuracy and higher abstentionthan HiRISENet on its corresponding test set. Given thelarger number of classes and smaller number of labeled im-ages, we believe that this classiﬁer is likely even more data-limited and would beneﬁt from additional data collection.Reliable posterior probabilities are likewise essential forMSLNet. We calibrated the MSLNet classiﬁer using tem-perature scaling, the most computationally efﬁcient methodamong the four calibration methods discussed in this paper(e.g., matrix scaling scales quadratically with the numberof classes, which is problematic for MSLNet). After cali-bration, test set accuracy using the conﬁdence threshold im-proved to . %, at the cost of increased abstention. For thisapplication, we are willing to sacriﬁce coverage to ensurethat the classiﬁcations provided to users are highly reliable.MSLNet’s ECE improved from . to . with temper-ature scaling. The reliability diagram of MSLNet after cali- rain (n=5920) Validation (n=300) Test (n=600) Acc Acc (0.9) Abst Rate Acc Acc (0.9) Abst Rate Acc Acc (0.9) Abst RateMost Common 26.3% - - 24.7% - - 31.2% - -MSLNet

MSLNet-calibrated . ; the second (intermediate) group includes ﬁve classes(e.g. “Layered rock”, “Drill hole”) whose F1 scores are be-tween . and . ; and the third group, those below thered curve, includes four classes (e.g. “Float rock”, “Wheeltracks”) whose F1 scores are less than . . We note that theclasses in the third group were evaluated on very few im-ages, so their performances are not statistically robust. Theseclasses require further improvement, and we plan to inves-tigate up-sampling or additional data acquisition to increasethe number of images of these classes. PDS Image Atlas Deployment

HiRISENet and MSLNet generate classiﬁcations that enablePDS users to quickly ﬁnd images of interest via content-based search. The public interface to the PDS image archivesis the PDS Image Atlas . Users can select an instrument https://pds-imaging.jpl.nasa.gov/search/ Recall P r e c i s i o n F1 near equals 0.6F1 near equals 0.2Arm coverOther rover partArtifactNearby surfaceClose-up rockDRTDRT spotDistant landscapeDrill holeNight skyFloatLayersLight-toned veinsMastcam cal targetSandSunWheelWheel jointWheel tracks

Figure 8: Calibrated MSLNet per-class precision and recall(test set).Figure 9: View from the Atlas of craters found in HiRISEimage PSP 002912 2075 RED. - - - - - - - - - - - Month0100200300400500600700 Q u e r i e s (a) HiRISENet queries - - - - - - - - - - - Month020004000600080001000012000 Q u e r i e s (b) MSLNet queries Figure 10: Number of monthly queries for HiRISENet and MSLNet classiﬁcations over 3.5 years (colors distinguish years).(e.g., HiRISE) and ﬁlter the images to only contain a par-ticular class of interest. To enable this kind of search, weapplied HiRISENet and MSLNet to the full archive of im-ages collected by the relevant instruments on Mars. Thesearchives total , HiRISE and , , MSL images,far more than the labeled subsets used for training and eval-uation. Figure 9 shows the Atlas user view of a HiRISEimage with all conﬁdently classiﬁed craters enclosed in redbounding boxes. Craters that are small, faint, degraded, ordistorted are less likely to pass the conﬁdence threshold, butthose identiﬁed as craters are highly reliable. In response touser requests, we added the ability to download a ﬁle thatcontains the latitude and longitude coordinates of each de-tected landmark, using PDSC to convert from pixel to geo-graphic coordinates.Classifying all HiRISE images took about ﬁve days ona GPU system and yielded , landmarks with a poste-rior probability of at least . , from classes that were not“Other”. We also ﬁltered out predictions for “Spider” or“Swiss cheese” at latitudes outside of the south polar re-gion (Aye et al. 2019). This total represents an 81% in-crease over the number of classiﬁed landmarks available inthe ﬁrst classiﬁer release (Wagstaff et al. 2018). MSLNetclassiﬁed , images with a posterior probability of atleast . . Both classiﬁers are integrated into the data inges-tion pipeline for the Atlas. As new data is delivered fromHiRISE or MSL, the images are automatically processed andtagged with their predicted classes.We track the number of Atlas queries that make useof HiRISENet and MSLNet classiﬁcations. As shown inFigure 10(a), HiRISENet exhibits regular and increasingusage over time. The most popular HiRISE class to bequeried is “Crater”. MSLNet shows more varied activity(Figure 10(b)), dominated by heavy usage in early 2019when the number of queries for “Wheel” increased by sev- https://github.com/JPLMLIA/pdsc eral orders of magnitude (note the difference in y axisrange). Given the small separation in query timestamps,most likely it was the result of a large number of scriptedqueries to the Atlas, which provides a public API. It is pos-sible that this classiﬁer output is serving to help train otherinvestigators’ models.We also analyzed the top 20 domain names from whichthe queries came. From January to July of 2020, we foundthat 40% of these queries came from hosts through an ISP,including Spectrum and Comcast as well as ISPs in the U.K.,the Netherlands, Romania, and Taiwan. Another 33% ofthese queries came from JPL domains, which is not surpris-ing given the relevance of the content to JPL projects. A mi-nority of the top 20 domain queries came from the RemoteSensing Technology Center of Japan (2%), the University ofWyoming (1%), and SoftBank (Japan) (1%), while 23% ofhostnames did not resolve to a domain.Finally, we used Google Analytics to examine the globaldistribution of visitors who made classiﬁcation queries. Be-tween July 2017 (the oldest data available) and August 2020,there were , visitors. The top ten countries and shareof visitors were: United States (51%), India (6%), UnitedKingdom (4%), Germany (3%), Canada (2%), France (2%),Italy (2%), Spain (2%), Australia (2%), and Russia (2%). Inall, visitors came from 180 different countries. Lessons learned.

The deployment of Mars image classi-ﬁers at scale has yielded several lessons learned. First, it isworth highlighting the challenge of deﬁning meaningful andrelevant classes up front. Unlike a ﬁxed benchmark data set,new Mars images are continually collected and new classescould arise at any time. Collaboration with domain expertsis vital for ensuring the relevance of the class deﬁnitions.Second, domain shift is an important factor in both datasets. Figure 11 compares the class distribution (excluding“Other”) for the labeled HiRISE data set (brown) to thepredictions made across the full HiRISE archive (orange). r i g h t d u n e c r a t e r d a r k d u n e i m p a c t e j e c t a s l o p e s t r e a k s p i d e r s w i s s c h e e s e F r a c t i o n o f i m a g e s LabeledDeployed

Figure 11: Class distribution for HiRISE labeled data set ver-sus predictions on the full archive.While the “Crater” class is dominant in both, its share of theimages classiﬁed nearly doubles when deployed. There arelikely two factors involved: our labeled data set is not fullyrepresentative of all of Mars, and “Crater” predictions mayin general be more conﬁdent and thus more likely to passthe 0.9 threshold and be retained here. Similar effects areseen in the MSL data set. We are currently investigating theuse of label shift adaptation increase the robustness of bothclassiﬁers.Finally, we found that our initial simplifying assumptionof one class per image is sometimes inadequate (a cratermight contain a dark slope streak; an MSL image mightcontain rover parts and the horizon). Even with guidance onclass priorities, human labelers sometimes found it difﬁcultto select a single class label. We plan to adopt a multi-labelapproach in future versions to allow more ﬂexibility and re-duce label noise.

Conclusions and Future Work

This paper presents the latest updates to the deployment ofmachine learning image classiﬁers in support of planetaryscience. Two classiﬁers, for orbital and surface images ofMars, have been operating since late 2016 to enable the ﬁrstcontent-based search of large NASA image archives. Us-age has increased over time, and feedback from users aswell as internal assessments have guided recent improve-ments. These include acquiring additional training data toimprove minority class performance for the HiRISE classi-ﬁer, deﬁning new classes of scientiﬁc interest for the MSLclassiﬁer, employing calibration to increase classiﬁer relia-bility, and making landmark coordinates downloadable. Toincrease our understanding of how and why users employthe machine learning classiﬁcations of Mars images, we areinvestigating minimally intrusive ways to learn more aboutuser motivations and use cases. Meanwhile, the performanceand scope of these classiﬁers continues to grow. Each timenew images are delivered by the instruments at Mars, they are automatically classiﬁed and added to the archive.We are currently developing a new classiﬁer, MERNet,that will operate on images collected by the two PanoramicCamera (Pancam) instruments on the Opportunity and SpiritMars Exploration Rovers (MER) rovers. Based on ourlessons learned, we are employing a multi-label approachso multiple labels can be assigned to a single image. MER-Net will classify all Pancam images using classes of bothscience and engineering interest that were identiﬁed in a sur-vey conducted by the MER Data Catalog project (Cole et al.2020). MERNet will provide users with the ﬁrst content-based search capability for MER images.Finally, we plan to incorporate label shift adapta-tion (Alexandari, Kundaje, and Shrikumar 2020) into futureupgrades of the Mars image classiﬁers. It is evident that thedata collected by Mars instruments is not i.i.d.; class fre-quencies change as spacecraft study different locations onMars, globally from orbit or locally via rover traverse. Byenabling the classiﬁers to adapt to class probability changes,we expect to obtain more reliable classiﬁcations.

Acknowledgments

We thank Michael McAuley from the PDS Imaging Nodefor the continuing support of this work and Anil Natha forassistance with the Google Analytics results. We also thankthe numerous volunteers who helped label the Mars im-ages. Part of this research was carried out at the Jet Propul-sion Laboratory, California Institute of Technology, undera contract with the National Aeronautics and Space Ad-ministration. This publication uses data generated via theZooniverse.org platform, development of which is fundedby generous support, including a Global Impact Award fromGoogle, and by a grant from the Alfred P. Sloan Foundation.

References

Alexandari, A.; Kundaje, A.; and Shrikumar, A. 2020. Max-imum likelihood with bias-corrected calibration is hard-to-beat at label shift adaptation. In

Proceedings of the 2020International Conference on Machine Learning , 222–232.Aye, K.-M.; Schwamb, M. E.; Portyankina, G.; Hansen,C. J.; McMaster, A.; Miller, G. R. M.; Carstensen, B.; Sny-der, C.; Parrish, M.; Lynn, S.; Maic, C.; Miller, D.; Simpson,R. J.; and Smith, A. M. 2019. Planet Four: Probing spring-time winds on Mars by mapping the southern polar CO jetdeposits. Icarus

Proceed-ings of the 51st Lunar and Planetary Science Conference ,Abstract

Journal of the Royal Statisti-cal Society: Series D (The Statistician)

Space Science Reviews

Proceedingsof the 34th International Conference on Machine Learning ,1321–1330.Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.;Girshick, R.; Guadarrama, S.; and Darrell, T. 2014. Caffe:Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 .Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Im-ageNet classiﬁcation with deep convolutional neural net-works. In

Advances in Neural Information Processing Sys-tems 25 , 1097–1105.Lipton, Z. C.; Wang, Y.-X.; and Smola, A. 2018. Detectingand correcting for label shift with black box predictors. In

Proceedings of the 2018 International Conference on Ma-chine Learning , 3128–3136.Lu, S.; and Wagstaff, K. L. 2020. MSL Curiosity roverimages with science and engineering classes, version 2.1.0.URL http://doi.org/10.5281/zenodo.4033453.Lu, S.; Wagstaff, K. L.; Cai, J.; Doran, G.; Grimes, K.; Lee,J.; Mandrake, L.; and Yue., Y. 2019. Improved content-basedimage classiﬁers for the PDS Image Atlas. In .McEwen, A. S.; Eliason, E. M.; Bergstrom, J. W.; Bridges,N. T.; Hansen, C. J.; Delamere, W. A.; Grant, J. A.; Gulick,V. A.; Herkenhoff, K. E.; Keszthelyi, L.; Kirkand, R. L.;Mellon, M. T.; Squyres, S. W.; Thomas, N.; and Weitz,C. M. 2007. Mars Reconnaissance Orbiter’s High Res-olution Imaging Science Experiment (HiRISE).

Journalof Geophysical Research (Planets)

Proceedings of the22nd International Conference on Machine learning , 625–632.Palafox, L. F.; Hamilton, C. W.; Scheidt, S. P.; and Alvarez,A. M. 2017. Automated detection of geological landformson Mars using Convolutional Neural Networks.

Computers& Geosciences

Advances in large margin classiﬁers

Planetary and Space Science

Proceedingsof the AIAA SPACE Forum .Rubner, Y.; Tomasi, C.; and Guibas, L. J. 1998. A metric fordistributions with applications to image databases. In

Pro-ceedings of the Sixth International Conference on ComputerVision , 59–66. doi:10.1109/ICCV.1998.710701.Wagstaff, K. L.; Lanza, N. L.; Thompson, D. R.; Dietterich,T. G.; and Gilmore, M. S. 2013. Guiding scientiﬁc discov-ery with explanations using DEMUD. In

Proceedings of theTwenty-Seventh Conference on Artiﬁcial Intelligence , 905–911.Wagstaff, K. L.; and Lu, S. 2020. Efﬁcient active learningfor new domains. In

Proceedings of the 2020 ICML on RealWorld Experiment Design and Active Learning .Wagstaff, K. L.; Lu, Y.; Stanboli, A.; Grimes, K.; Gowda,T.; and Padams, J. 2018. Deep Mars: CNN classiﬁcation ofMars imagery for the PDS Imaging Atlas. In

Proceedings ofthe Thirtieth Annual Conference on Innovative Applicationsof Artiﬁcial Intelligence , 7867–7872.Wagstaff, K. L.; Panetta, J.; Ansar, A.; Greeley, R.; Hoffer,M. P.; Bunte, M.; and Schorghofer, N. 2012. Dynamic land-marking for surface feature identiﬁcation and change detec-tion.