[PDF] An Active Learning Framework for Constructing High-fidelity Mobility Maps

Abstract

A mobility map, which provides maximum achievable speed on a given terrain, is essential for path planning of autonomous ground vehicles in off-road settings. While physics-based simulations play a central role in creating next-generation, high-fidelity mobility maps, they are cumbersome and expensive. For instance, a typical simulation can take weeks to run on a supercomputer and each map requires thousands of such simulations. Recent work at the U.S. Army CCDC Ground Vehicle Systems Center has shown that trained machine learning classifiers can greatly improve the efficiency of this process. However, deciding which simulations to run in order to train the classifier efficiently is still an open problem. According to PAC learning theory, data that can be separated by a classifier is expected to require O(1/ϵ) randomly selected points (simulations) to train the classifier with error less than ϵ . In this paper, building on existing algorithms, we introduce an active learning paradigm that substantially reduces the number of simulations needed to train a machine learning classifier without sacrificing accuracy. Experimental results suggest that our sampling algorithm can train a neural network, with higher accuracy, using less than half the number of simulations when compared to random sampling.

Full PDF

AAn Active Learning Framework for ConstructingHigh-ﬁdelity Mobility Maps

Gary R. Marple ∗ , David Gorsich † , Paramsothy Jayakumar † , Shravan Veerapaneni ∗∗ Department of Mathematics, University of Michigan † U.S. Army CCDC Ground Vehicle Systems Center

Abstract —A mobility map, which provides maximum achievablespeed on a given terrain, is essential for path planning ofautonomous ground vehicles in off-road settings. While physics-based simulations play a central role in creating next-generation,high-ﬁdelity mobility maps, they are cumbersome and expensive.For instance, a typical simulation can take weeks to run ona supercomputer and each map requires thousands of suchsimulations. Recent work at the U.S. Army CCDC GroundVehicle Systems Center has shown that trained machine learningclassiﬁers can greatly improve the efﬁciency of this process.However, deciding which simulations to run in order to train theclassiﬁer efﬁciently is still an open problem. According to PAClearning theory, data that can be separated by a classiﬁer is ex-pected to require O (1 /(cid:15) ) randomly selected points (simulations)to train the classiﬁer with error less than (cid:15) . In this paper, buildingon existing algorithms, we introduce an active learning paradigmthat substantially reduces the number of simulations needed totrain a machine learning classiﬁer without sacriﬁcing accuracy.Experimental results suggest that our sampling algorithm cantrain a neural network, with higher accuracy, using less than halfthe number of simulations when compared to random sampling. I. I

NTRODUCTION

Mobility is the essential requirement for all military groundvehicles and the loss of mobility due to unfavorable terrainor soil conditions can jeopardize a mission’s success and canleave troops stranded. To avoid this scenario, a planner mustuse a mobility map that gives the maximum predicted speedthe vehicles would be expected to reach while traversing atreacherous off-road terrain. The NATO Reference MobilityModel (NRMM) is widely used for predicting the mobility ofground vehicles [1], [2]. However, newer vehicles containingadvanced technologies have mobility capabilities that cannoteasily be predicted using the NRMM method because of itsempirical nature. As a result, the NATO Next GenerationNRMM Team has identiﬁed physics-based modeling, mainlyusing the discrete element method (DEM), as being a potentialhigh-ﬁdelity method for predicting mobility [3].While physics-based simulations can offer more accuratepredictions of vehicle mobility, generating a mobility map thatcan accurately predict a mobility metric—such as the speed-made-good , deﬁned as the ratio of the Euclidean distance andthe time required to travel between two points, regardlessof the actual path taken [1]—requires tens of thousands ofsimulations. Consequently, it can take weeks to generate suchmaps using high performance computing (HPC) architectures.Recent work by Mechergui and Jayakumar [4] showed that machine learning classiﬁers can be trained to quickly generatemobility maps. While this approach is promising, it comeswith its own challenges.The main obstacle with using supervised learning is thecomputational expense that is incurred when constructingthe training set. A computationally-intensive DEM simulationmust be performed to predict the speed-made-good to label asingle data point. This becomes prohibitively expensive sinceperforming each numerical simulation can take over a weekon a 20-core compute node using state-of-the-art simulationsoftware; see [5] for more details. Based on the usage cost ofthe HPC cluster utilized for running the numerical simulationsin this paper (the Flux supercomputer at the University ofMichigan [6]), generating even a small training set with200 points costs over ten thousand dollars.

An additionalissue is that simple sampling techniques, such as uniformrandom sampling, often choose uninformative data points thathave little to no effect on the accuracy of the classiﬁer. Inother words, running simulations with parameter values thatthe classiﬁer is already conﬁdent about only reinforces themodel and does little to illuminate inaccuracies that can beimproved upon. Finally, the focus of [4] was on trainingclassiﬁers with 2-dimensional feature spaces only, when inactuality, the speed-made-good depends on many parameters,such as the terrain topology and proﬁle, soil type (mud, snow,sand, etc.), vegetation, and weather conditions. This poses asigniﬁcant challenge due to the curse of dimensionality andthe computational limitations that prevent us from generatinglarge training sets.Of course, some of these challenges may be alleviated bymore efﬁcient DEM simulations, algorithms for which are anactive area of research [7]–[10]. However, even a moderatereduction in simulation times would not eliminate the need formachine learning-based predictions. Once trained, a machinelearning classiﬁer, such as a multilayer perceptron (MLP), cangenerate mobility maps on-the-ﬂy. This is critically impor-tant since changing weather conditions can quickly alter soilproperties. In addition, uncertainty quantiﬁcation techniquesthat account for imprecise measurements can be signiﬁcantlyhindered if the model evaluation time is long. Machine learn-ing classiﬁers offer a faster and more economical means ofaddressing these problems. However, training these algorithmsin a reasonable time with an affordable budget is still achallenge.To reduce the number of simulations, in this work, we

DISTRIBUTION A. Approved for public release; distribution unlimited. OPSEC a r X i v : . [ c s . L G ] M a r ig. 1. (a) A DEM-based simulation of a vehicle traversing off-road conditions. The vehicle shown in this ﬁgure will be used for all experiments. (b) Anexample of a mobility map. The colors indicate the maximum sustained speed an off-road vehicle would be expected to reach. developed an active learning-based approach that allows usto generate mobility maps using less data. Unlike supervisedlearning where the entire training set is constructed a priori ,active learning allows the classiﬁer to interactively query anannotator about a pool of unlabeled data [11]. Once the queriedinstances have been labeled, they are added to the training set,the classiﬁer is retrained, and a new set of instances are chosen.In many cases, this iterative approach can train a classiﬁerwith higher accuracy using fewer instances when comparedto supervised learning. Active learning has been utilizedin many areas, including natural language processing, drugdiscovery, text classiﬁcation, image retrieval, medical imageclassiﬁcation, and landslide prediction [12]–[17]. In general,an active learning-based approach is often advantageous whenthe application has a large amount of unlabeled data available,but labeling that data is expensive or time-consuming. This isprecisely the situation that arises when training a classiﬁerto generate mobility maps. In our case, the annotator is acomputationally demanding DEM-based simulation, and theunlabeled pool that the active learner picks from consists of allpossible combinations of soil parameters. Building on existingactive learning algorithms, we propose a framework that istailored to the needs of mobility map construction.The paper is organized as follows. We will begin bygiving a primer on active learning in Section II, where wereview the query-by-bagging algorithm and discuss some of its advantages over uncertainty sampling. In Section III, wewill focus on some of the main considerations that wentinto the development of our active learning approach tailoredto mobility map generation. Section IV will focus on ourexperimental results, where we used physics-based simulationsto train a neural network to predict the speed-made-good using2- and 3-dimensional feature spaces. Finally, Section V willgive our conclusions along with areas for future research.II. P RELIMINARIES

In this section, we begin by introducing the notion of a versionspace and show how it can be used to identify informativeinstances. This will lead to a discussion about the query-by-committee and query-by-bagging algorithms and why query-by-bagging is well-suited for predicting simulation results.

A. Version Space

To begin with, assume that all instances have noise-free labels.To introduce the version space, we ﬁrst need to deﬁne ahypothesis. A hypothesis h : X → Y is a function, generatedby a machine learning algorithm, that maps instances x inthe feature space X to labels y in the set of class labels Y . A hypothesis space H is the set of all hypotheses underconsideration. For example, a single set of values for theweights in a MLP could be used to specify a single hypothesis,and the set of all possible sets of values for the weightsin a MLP could be thought of as being representative of aypothesis space. The set of all hypotheses h ∈ H that areconsistent with the training set L is referred to as the versionspace V [18]. To be precise, V = { h ∈ H| h ( x ) = y for all (cid:104) x, y (cid:105) ∈ L} . Now suppose that V is bounded and that there exists ahypothesis c ∈ H that can correctly label any instance in X .This hypothesis would be consistent with any training set, sowe can conclude that c ∈ V . We would like to query instancesthat allow us to focus in on c by reducing the size of V .If we choose to query an instance whose label happens tobe consistent with all of the hypotheses in V , then V wouldremain unchanged after the instance was added to L . Clearly,this is something we would like to avoid. On the other hand, ifwe query an instance that has a label that is inconsistent withsome of the hypotheses in V , then those hypotheses wouldbe eliminated from V . Since the label for a point is unknownin advance, a good approach is to query instances where anylabel will result in the elimination of a large portion of V .In other words, we should try to query points that create thegreatest amount of disagreement. For example, we could try toquery instances where at least half of the hypotheses disagree,no matter the label. This would cut the size of V by at leasta half each time we queried a point, giving us exponentialconvergence onto c . The main challenge with doing this isin determining when the majority of the hypotheses in V disagree. Fortunately, the query-by-committee algorithm givesus a way of approximating this. B. Query-By-Committee

The query-by-committee (QBC) algorithm is a highly effectiveactive learning technique that focuses on reducing the versionspace by forming a committee of hypotheses that serve as arepresentative sample of V [19]. This committee of hypothesesdetermines whether a point should be queried or not byfollowing a voting process. If the majority of the hypotheses inthe committee disagree on the label for a point, then queryingthat point will eliminate at least half of the committee membersfrom V . Furthermore, if the committee is representative of V ,then we would expect that V would be approximately halvedas well. Under certain conditions, QBC has been shown toachieve prediction error (cid:15) with high probability using O (1 /(cid:15) ) unlabeled instances and O (log(1 /(cid:15) )) queries [20], that is, anexponential improvement over random sampling.While such a promising theoretical guarantee is encourag-ing, it turns out that QBC has few practical applications. Partof the reason for this is that noisy labels can lead to scenarioswhere V is empty. In other words, noisy labels can result in theelimination of hypotheses that would otherwise be consistentwith the data. As a result, this can make it impossible to forma committee. An additional issue can also occur when V isnonempty. When using a deterministic classiﬁer, such as aSVM, it can be difﬁcult to ﬁnd multiple hypotheses that areconsistent with the data [21]. The answer to both of theseissues is to randomize the training set using a technique calledquery-by-bagging [22]. Algorithm 1:

Query-By-Committee

Input:

Number of trials: N Number of committee members: n c A randomized classiﬁer: A Initialize: L , U with random instances for k = 1 , . . . , N do

1. Train A on L k and generate h , . . . , h n c .2. Choose a point x ∗ ∈ U k that satisﬁes max y ∈ Y |{ t ≤ n c | h t ( x ) = y }| ≤ n c / .3. Query x ∗ to obtain y ∗ = Oracle ( x ∗ ) .4. L k +1 = L k ∪ {(cid:104) x ∗ , y ∗ (cid:105)} , U k +1 = U k \ { x ∗ } endOutput: h ( x ) = arg max y ∈ Y |{ t ≤ n c | h t ( x ) = y }| , where h t are hypotheses of the N th stage. C. Query-By-Bagging

The query-by-bagging (QBag) algorithm, ﬁrst introduced byAbe and Mamitsuka [22], can be thought of as a combinationof QBC and bagging [23]. Like QBC, the method uses anensemble of classiﬁers to make querying decisions. However,instead of training all committee members on the same data,QBag trains each classiﬁer on a subset of the training set.Usually, these subsets are constructed by randomly sampling(with replacement) from the original training data. This allowsdeterministic classiﬁers, such as a SVM, to generate multiplepredictions while still being trained with data that has a similardistribution to the initial training set. Like QBC, instances arequeried if there is a signiﬁcant amount of disagreement amongthe committee members.

D. Comparison with Uncertainty Sampling

Uncertainty sampling is a popular approach for performingactive learning [24]. It has been used with many differentapplications and has been shown to reduce the number ofqueried instances under certain conditions. In this work, weavoided uncertainty sampling; we would like to highlight someof the reasons why. Let’s begin with a brief overview ofthe approach. The main idea behind uncertainty sampling issimple: query instances where the model is most uncertainabout the label. There are numerous ways to do this. Forexample, an active learner may query the instance with thegreatest Shannon entropy, which is given by φ ENT ( x ) = − (cid:88) y P θ ( y | x ) log P θ ( y | x ) , where P θ ( y | x ) is the predicted probability, generated by theclassiﬁer, that an unlabeled instance x will have label y given a set of model parameters θ . The result for this andmany other metrics is that most of the queried instances tendto lie on or near the decision boundaries. While samplingalong decision boundaries can help the classiﬁer make moreaccurate predictions, there are some additional considerationsthat shouldn’t be overlooked.To begin with, uncertainty sampling bases its queries on thepredictions made by a single classiﬁer, which is often trainedsing limited data. As a result, the classiﬁer may make highlyerroneous assumptions about the location of the decisionboundaries. Most likely, this would cause the learner to queryuninformative points, which in our case, would result in a poorutilization of our computing resources. This problem is madeworse when points are queried in batches, since each pointin the batch is queried using the same inaccurate classiﬁer.For our application, multiple simulations will be performed inparallel, so batch sampling is a must. An additional concernhas to do with noisy data. While all learning algorithmstend to perform worse on noisy data, uncertainty samplingis especially vulnerable. This is due to the classiﬁer relyingon a small set of unreliable data, which can easily lead toinaccurate predictions.With QBag, an ensemble of classiﬁers is used to makequerying decisions. This means that a few inaccurate classiﬁerswould not necessarily alter the querying results. Instead, theensemble tends to have an averaging effect where highlyerroneous predictions can be balanced out by the more sensibleones. An additional beneﬁt is that each classiﬁer in theensemble is trained using a slightly different training set.This means that a mislabeled point has less inﬂuence onthe querying process since some of the classiﬁers will beunaware of its existence. For additional details on how ouractive learning paradigm performs with noisy data, refer toSection III-C. Also, it is important to keep in mind that QBagis a version space method and was not developed with noisydata in mind. Most of the data we will be dealing with isdeterministic, except for an occasionally mislabeled point dueto numerical errors in the DEM-based simulations.III. P ROPOSED P ARADIGM

This section will focus on some of the challenges that wereencountered while developing our active learning paradigm.We will discuss several topics that are relevant to the applica-tion, such as classiﬁer selection, dealing with noisy labels, andexploration vs exploitation. This section will conclude withpseudocode for our active learning paradigm. A. Model Selection

Selecting the right machine learning algorithm is an essentialpart of making accurate predictions. For this application, themain challenge is in generating a large enough training set,so the learning algorithm needs to be resourceful with thelimited data it has available. Our scheme uses a classiﬁerbecause many mobility maps use distinct colors to classifyspeed ranges, like the one shown in Fig. 1(b). For this type ofmobility map, accurately resolving the decision boundaries isfar more important than knowing the precise speed throughoutthe feature space. Therefore, a classiﬁer that can query in-stances along the decision boundaries would likely require lessdata than a regression model that might spread itself thin byattempting to make accurate predictions throughout the entirefeature space.In [4], authors tested the efﬁcacy of various classiﬁersto generate mobility maps including kNNs, SVMs, MLPs, kriging, decision trees and random forests. They found thatMLP provided the most accurate predictions followed bySVM. In our own tests using these classiﬁers, we obtainedsimilar results. Only two features—the longitudinal slope andthe cone index—were considered in [4]. However, additionalfeatures are usually required in practice and little is knownabout the target function in those cases. Therefore, a generalpurpose classiﬁer is needed. As a result, we ruled out SVMclassiﬁers with linear, quadratic or cubic kernels, since theycannot accurately model disjoint regions of the same class.Instead, we focused on MLP and SVM with a Gaussian kernel.These two classiﬁers were compared in Section IV-B1 usingpreliminary data.

B. Overﬁtting

Because little is known about the target function in advance,the classiﬁer needs to be complex enough to accurately capturethe underlying trends in the data. While on the other hand, ifthe classiﬁer is too complex, it may “memorize” individualinstances and overﬁt the data. This is especially concerningsince our training sets will be limited in size. As a result,we utilized several techniques to avoid overﬁtting while stillensuring that the model can capture more complex patterns inthe data.To begin with, the MLP classiﬁer will use a rectiﬁed linearunit (ReLU) as the activation function. Unlike sigmoid andtanh activation functions, ReLU is sparsely activated [25].This means that neurons can be turned off during the trainingprocess which reduces the complexity of the model and helpsto prevent overﬁtting.Bagging will be used to reduce the importance placedon individual data instances. Bagging uses an ensemble ofclassiﬁers to make predictions about new data instances andworks by training each classiﬁer in the ensemble on a subsetof the training set. Usually these subsets are constructedby randomly selecting instances from the training set withreplacement. To make predictions about new instances, theensemble takes a majority vote among its members. Thisaveraging effect reduces the variance and helps to preventoverﬁtting.Finally, both SVM and MLP classiﬁers will use a gridsearch to tune model parameters. While performing the gridsearch, the accuracy of the models will be measured usinga k-fold cross-validation. This approach will allow us toapproximate the accuracy of the model without having to useadditional resources to construct a validation set. For the SVM,the penalty parameter C will be updated using a grid searcheach time a new batch of unlabeled instances is chosen. For theMLP, the number of neurons will be viewed as a parameter andwill be determined using a grid search as well. By allowingthe MLP to start with a few neurons and gradually add more asadditional data becomes available, we are able to avoid issueswith overﬁtting while still allowing the model to increase incomplexity when needed. Refer to the Section IV-B1 for moredetails. . Noisy Labels Noisy data, due to numerical errors or anomalies in thesimulations, can have a detrimental effect on the accuracy ofany learning algorithm. This is especially the case when usingactive learning, since the querying techniques are designed toseek out instances that will greatly affect the accuracy of theclassiﬁer. Based on data that was obtained in [4], we estimatedthat approximately 2% of data instances would be mislabeled.While agnostic active learning techniques that are designedfor noisy data sets do exist (e.g., [26]), they tend to convergemore slowly than aggressive techniques, such as QBag, whenlabels are mostly deterministic.As previously mentioned, we used QBag to query instancesand bagging to make predictions. Because each classiﬁer in theensemble only sees a subset of the training set, this reduces theimpact a mislabeled instance may have on querying decisionsand model predictions. In addition, when mislabeled instancesdo inﬂuence the querying process, they often lead the activelearner to select unlabeled instances near the mislabeled point.If the incorrect label was the result of a random error and not amore fundamental issue with the oracle, then labeling instancesnear a mislabeled point can provide additional evidence forthe right class. To make sure our active learning paradigm canperform well when instances are occasionally mislabeled, wetested it using an oracle that incorrectly labeled instances 10%of the time. The results from these experiments are given inSection IV-B1.

D. Batch-Mode Sampling

In order to generate enough data for the learning algorithm,we will need to run multiple simulations at a time. However,choosing instances that are close to each other or a labeledpoint can give redundant information if the label happens tomatch the class of its neighbors. To avoid this, let us ﬁrst deﬁnethe region of disagreement. Let D , the region of disagreement,be the set of points in the feature space X where no more thanhalf of the committee members agree on a label. That is, D = (cid:26) x ∈ X (cid:12)(cid:12)(cid:12)(cid:12) max y ∈ Y |{ t ≤ n c | h t ( x ) = y }| ≤ n c / (cid:27) , where t is a positive integer, h t is the t th classiﬁer, and n c is the size of the committee. To avoid redundancies, we willchoose instances in the region of disagreement that are both farfrom labeled instances and other instances that will be queriedin the same batch. We do this by ﬁrst choosing the unlabeledinstance in the region of disagreement that is as far as possiblefrom all other labeled instances. Next, we choose the instancethat is in the region of disagreement and that is also as faras possible from any labeled instance and the ﬁrst queriedinstance. This process repeats until the number of sampledpoints equals the batch size. We summarize this process inAlgorithm 2. To determine the batch size, we ran experimentson the test function shown in Fig. 2. Refer to Section 3 formore details. Algorithm 2:

MaxMinSample ( n, U , L ) Input:

Number of points to sample: n A set of unlabeled instances to choose from: U A set of labeled instances: L Initialize Q = ∅ for k = 1 , . . . , n do x ∗ k = arg max x k ∈U (cid:18) min x ∈L∪Q k (cid:107) x − x k (cid:107) (cid:19) Q k +1 = Q k ∪ { x ∗ k } endOutput: A set of well-spaced instances: Q n +1 E. Exploration vs Exploitation

Sampling bias is an issue with many active learning strategies.By allowing the learning algorithm to query new instances,there’s often a good chance that the ﬁnal training set willnot accurately represent the true underlying distribution ofthe target function. As a result, this can cause the activelearner to converge to a suboptimal hypothesis. In practice,this often means that the learning algorithm never discoversthat it incorrectly classiﬁed a portion of the feature space,because the sampling scheme only queries points outside ofthe misclassiﬁed region. This tends to happen when the activelearner queries too many points near the known decisionboundaries and ignores more distant points that might exposethe learner’s oversight.One way to avoid this is by occasionally querying pointsoutside the region of disagreement. However, this can betricky since querying too many points will slow down theconvergence of our scheme, but not querying enough pointsmay cause the learner to misclassify large portions of thefeature space. This conundrum is known as the explorationvs. exploitation dilemma. In this case, exploration is referringto the process of querying over a large portion of the featurespace so that all misclassiﬁed regions are discovered in areasonable time. Exploitation, on the other hand, is the processof querying over small regions with promising results, such asthe region of disagreement, in order to reﬁne the ensemble’spredictions.To ﬁnd the right balance, we performed experiments on thetest function discussed in Section IV-A1. We found that ifwe used 1/8th of the queried points for exploration and theremaining points for exploitation, then the convergence wasonly slightly slower than what we observed when we didn’tuse any exploratory points at all. Also, this proportion was stilllarge enough so that the scheme could quickly and consistentlydiscover all 5 classes.When choosing exploratory points, we used the same ap-proach that we used when doing batch sampling. The onlydifference is that we now choose points outside of the regionof disagreement. We used this approach since we wanted toavoid querying exploratory points near labeled points, sincedoing so would likely provide us with redundant information. lgorithm 3:

Active Learning Paradigm

Input:

Number of trials: N Batch size: n b Number of exploratory points: n e Number of committee members: n c A classiﬁer: A Initialize: L , U with random instances for k = 1 , . . . , N do

1. Use cross validation to ﬁnd parameters for A .2. Randomly sample from L k with replacement to obtainsubsamples L (cid:48) , . . . , L (cid:48) n c each of size m c .3. Train A on each subsample to obtain h , . . . , h n c .4. Find D which consists of the points x ∈ U that satisfy max y ∈ Y |{ t ≤ n c | h t ( x ) = y }| ≤ n c / .5. Q k = MaxMinSample ( n b − n e , D , L k ) E k = MaxMinSample ( n e , U \ D , L k ) X ∗ = Q k ∪ E k , y ∗ = Oracle ( X ∗ ) L k +1 = L k ∪ (cid:104) X ∗ , y ∗ (cid:105) , U k +1 = U k \ X ∗ endOutput: h ( x ) = arg max y ∈ Y |{ t ≤ n c | h t ( x ) = y }| , where h t are hypotheses of the N th stage.IV. E XPERIMENTS

This section will focus on the experiments that were performedusing our active learning scheme. We will begin by describingthe setup for these experiments along with additional detailson how the active learning paradigm was conﬁgured. Wewill then discuss the results from some of the preliminarytests we performed. We used these preliminary tests to gaugethe model’s performance before running simulations. Afterobserving satisfactory results from the preliminary tests, wethen used physics-based simulations to label instances in both2- and 3-dimensional features spaces. For the 2-dimensionalproblem, the adhesion factor and friction coefﬁcient were usedas the two features. These features differ from the ones thatwere considered in [4] and will serve as a credible challengeto our active learning scheme, since the target function isunknown in advance. Finally, we will include a third feature,the soil density, and will compare our results with randomsampling.

A. Experimental Setup1) Preliminary Tests:

We initially had access to the 528data points that were generated in [4]. The points and theirlabels are shown in Fig. 2(a). The labels indicate a rangefor the speed-made-good based on the longitudinal slope andthe cone index. Before running simulations, we constructed atest function in order to gauge the performance of our activelearning scheme. The test function was constructed by trainingSVM with a cubic kernel on the 528-point data set and isshown in Fig. 2(b). We felt that this test function would serveas a good predictor of the model’s performance, since theoriginal data set was generated using the same DEM-basedmodel.

Fig. 2. (a) This ﬁgure shows the original data set that was obtained in [4].There is a total of 528 points, and each point was labeled by running a DEMsimulation. (b) This ﬁgure shows the target function that was used in thepreliminary tests. The target function was constructed by training SVM witha cubic kernel on the data set in (a).

For the preliminary tests, we used a MLP and a SVM witha Gaussian kernel to predict the speed-made-good. The testfunction in Fig. 2(b) was used as the ground truth. For theMLP classiﬁer, we used a single hidden layer and determinedthe number of nodes by performing a 10-fold cross-validationon the labeled pool. We started with just 4 nodes in thehidden layer, and with each new batch, we either halved,doubled, or left the number of nodes unchanged, dependingon which option gave the lowest cross-validation error. As asafeguard, we never let the number of nodes drop below 2,even though this did not appear to be a signiﬁcant issue, sincethe number of nodes tended to gradually increase as morelabeled data became available. This approach was adoptedbecause unlike supervised learning where the training setis known in advance, little may be known about the targetfunction and its complexity, so choosing a reasonable numberof nodes without eventually overﬁtting or underﬁtting theavailable data can sometimes be challenging. For the 2D testcase, we did not see much difference in the prediction accuracywhen we used fewer nodes, such as 10, or a larger number,uch as 100. However, in a 3D preliminary test case where weextruded the 2D test function, overﬁtting became more of anissue, and an adaptive approach was needed.For the SVM, we set the kernel coefﬁcient γ = 1 / andused a grid search combined with a 10-fold cross-validationon the training data to determine the penalty parameter C .The penalty parameter was updated each time a new batch oflabeled instances was added to the labeled pool L .For both classiﬁers, we set the batch size n b = 32 , thenumber of exploratory points n e = 4 , and the number ofcommittee members n c = 20 . While smaller committee sizeswould likely yield similar results, we were not as concernedwith the computational cost of the active learner, since it’s costis negligible in comparison to the simulations. In addition, alarge committee can help to average out some of the moreextreme predictions made by a minority of its members. Thiscan help to avoid excessive ﬂuctuations in the predictions madeby the committee as more training data becomes available.To understand how our scheme would perform with occa-sionally mislabeled data, we ran additional tests where eachqueried instance was mislabeled of the time. We chosethis number, because we felt it was a conservative overestimateof the noise that may result due to numerical errors in thesimulations. As a commonly used yard-stick, we compared allof our results to random sampling, where we used the sameensemble of 20 classiﬁers to make predictions.

2) 2D Problem:

To test our scheme’s performance on a“real-world” problem, we ran full length DEM simulations topredict the speed-made-good for a previously unknown targetfunction. This target function depends on two dimensionlessvariables: the friction coefﬁcient and the adhesion factor. Sincewe could only perform this test once, due to computationalconstraints, we used a MLP as our classiﬁer, since it tended tomake more accurate predictions with fewer training instanceswhen compared to the SVM. Refer to Section IV-A1 for moredetails.As mentioned previously, additional tests performed onthe target function in Fig. 2(b) did not show any signiﬁcantdifference in the accuracy of the predictions when we variedthe number of neurons in the hidden layer from 10 to 100.Therefore, we used 100 neurons for the 2D problem in casethe target function happened to be fairly complex. It wasn’tuntil later, after the simulations for the 2D problem completed,that we observed that overﬁtting could pose a more signiﬁcantchallenge in 3D. As a result, the adaptive approach that wasmentioned earlier was only utilized for the 3D case and thepreliminary tests. However, based on our earlier tests, we donot expect that using 100 neurons in the hidden layer wouldhave any signiﬁcant effect on the accuracy of our predictionsfor the 2D case.Other than the number of neurons in the hidden layer, allhyperparameters will be the same as the ones used in thepreliminary tests. This includes the batch size, which was set to32. We chose this number because it appeared to be a relativelylarge batch size that consistently gave rapid convergence to thetarget function in Fig. 2. In an ideal setting, a smaller batch size would be preferred, but due to the computational demandsinvolved in labeling points, running simulations in parallel wasthe only practical option.

3) 3D Problem:

For the 3D case, we added an additionalparameter, the soil density, and again used DEM simulationsto label queried points. The setup for this test was nearlyidentical to the 2D case, except for a few things. The ﬁrst wasthat we allowed the number of neurons in the hidden layerto vary, depending on the cross-validation errors. Second, weran many more simulations. To be precise, we ran a total of1,384 full length DEM simulations. In all, there were 468simulations that were used to make up the testing set, 448simulations that were chosen using our sampling scheme,and 448 simulations that were used to construct a randomlygenerated training set. In addition to that, 20 more simulationswere used to construct the initial data set that the QBag andrandom sampling schemes both started from. Of course, wewould expect that more data would be needed due to thecurse of dimensionality. However, there may be a slight beneﬁtthat comes with this extra dimension. This leads to the thirddifference, which is the size of the batches. For the 2D case,we used batches of 32. However, for some problems with 3features, we noticed that quick convergence could often beobserved with batch sizes larger than 32. We suspect thatthis is the case, because the higher dimensionality of thedecision boundaries makes it possible for more points to bequeried simultaneously without too much redundancy. Becauseof this and additional tests we performed using basic 3D testfunctions, we will be using batches of 64 on the 3D case.

B. Results1) Preliminary Tests:

Figs. 3(a) and 3(b) show a com-parison between random sampling and our active learningapproach. The predictions for both ﬁgures were generated byan ensemble of MLPs. Notice how the active learning-basedapproach focuses heavily on the decision boundaries whileoccasionally sampling points throughout the feature space. InFig. 3(a), we can see that the same ensemble had difﬁcultiesreproducing the test function when we used a training set con-sisting of randomly queried points. Notice how the ensembleoften guesses where the decision boundary is by placing theline approximately halfway between neighboring points withdistinct labels. While this may seem like a reasonable thingto do, and it probably is, the lack of data near the decisionboundaries results in a classiﬁer that misses many of thedetails that are captured by our active learning-based approach.As a result, our sampling scheme achieved an accuracy of99.0% with only 212 labeled data points, and random samplingachieved an accuracy of 95.7%. While this may not seem likea huge difference, obtaining that extra 3.3% can be extremelydifﬁcult when the accuracy is so high to begin with.Fig. 4(a) shows how the number of queried instancesimpacts the error rates for the ensemble of classiﬁers. Togenerate these plots, each ensemble was trained 30 timesstarting with distinct, randomly generated initial training sets.Each line indicates the median accuracy that was obtained ig. 3. These two ﬁgures show a comparison between random sampling (a)and our active learning scheme (b). In both cases, the target function in Fig.2(b) was used as the ground truth. For this comparison, all queried pointswere correctly labeled. by the corresponding ensemble, and the error bars show the50% conﬁdence interval. Notice how both MLP and SVMclassiﬁers attained higher accuracies with lower variation whenthey were trained using our active learning scheme. Afterbeing training on 212 points, the MLP reached a medianaccuracy of 98.9% when trained with our scheme, and theSVM performed slightly better with a ﬁnal accuracy of 99.1%.With random sampling, the MLP maxed out with a medianaccuracy of 95.4% and the SVM attained a median accuracyof 95.3%. Now suppose that the desired accuracy is set at 95%.Using random sampling, this is ﬁrst achieved by the MLP afterrunning 180 simulations and by the SVM after running 212simulations. With our active learning-based approach, 95%accuracy is ﬁrst surpassed using 116 simulations.In Fig. 4(b), we again compared random sampling withour active learning scheme. However, this time, we trainedthe classiﬁers with noisy data. As before, we used the sameensemble of 20 MLPs. The difference is that we intentionallyprovided the learner with incorrect labels 10% of the time. Thiswas to simulate noisy labels due to numerical inaccuracies in the DEM-based simulations. We will see later that this in factbecame an issue once we started labeling instances using simu-lations. However, despite being trained with partially incorrectdata, our active learning scheme still outperformed randomsampling, though in general we wouldn’t necessarily expectthe difference to be as large as it was in the noise free case. Partof the reason we’re seeing such a considerable improvementin performance, despite the noisy data, may be due to theensemble creating a small region of disagreement around themislabeled instances. This could coax the learning algorithminto querying additional points nearby. If the mislabeled pointwas simply the result of a random error, then querying morepoints nearby could provide additional evidence for the correctlabel.Based on the results from Figs. 4(a) and 4(b), we decidedto mainly focus on the MLP classiﬁer in the remaining tests.The reason for this is that the MLP typically produced moreaccurate results sooner than the SVM. This is an importantfeature when generating mobility maps, since in most cases itwill only be possible to run a limited number of simulations.

2) 2D Problem:

In Figs. 5(a) and 5(b), we compare thepredictions made by the MLP ensembles after querying pointsrandomly and with our active learning scheme, respectively. Ineach case, we trained on the same initial set of 20 randomlysampled points. After that, an additional 192 instances werequeried for each approach. To measure the accuracy of thepredictions, we constructed a separate testing set that consistedof 212 randomly generated points. We found that the activelearning scheme did signiﬁcantly better than random sampling.To be precise, the active learning scheme achieved 98.1%accuracy on the testing set, while random sampling onlyreached 93.4%.The prediction accuracy vs. the number of simulations isshown in Fig. 6. Notice that after being trained with 116points, our scheme surpassed the highest accuracies obtainedby random sampling. Therefore, to reach 95% we would needto sample at least 65 more points with the random schemethan we would with the active learning approach.Finally, if we refer back to Fig. 5, there’s one more thingto discuss. Near the point (0 . , . , there appears to be adata point that was mislabeled. (The point near (0 . , . may have been mislabeled as well.) As mentioned earlier, thepotential for noisy data was a concern from the beginning.In this case, it appears that the mislabeled point may be theresult of the time step being too large in the DEM simulations,since the simulations tend to lose stability when the frictioncoefﬁcient and adhesion factor are both small. This couldpotentially be addressed by applying a voting ﬁlter to the datain order to identify points with questionable labels [27]. Thespeed-made-good could then be reevaluated using a smallertime-step.

3) 3D Problem:

In Figs. 7(a) and 7(b), we plotted thelabeled instances that were selected using random samplingand our active learning scheme, respectively. Notice that mostof the points selected by our active learning scheme tend tocluster around the decision boundaries. In addition, there are ig. 4. These two ﬁgures compare the median accuracy of the tested schemesas the number of labeled instances is increased. (a) Shows the accuracywhen all queried points are correctly labeled. (b) Shows the accuracy when10% of the data is mislabeled. The error bars in both ﬁgures show the50% conﬁdence interval when the learning algorithms were trained 30 timeswith distinct, randomly generated initial data. Notice that the improvement inaccuracy for the MLP is reduced when some of the training set is mislabeled.Mislabeled data can erode away at some of the gains in accuracy that areachieved through active learning. Therefore, it’s imperative that the numberof mislabeled instances be kept to a minimum. a few instances farther away that serve as exploratory points.For this test, each batch of 64 points had 56 points that werequeried within the region of disagreement and 8 points wereused for exploratory purposes.Fig. 8 shows a horizontal slice of the prediction that wasgenerated using our active learning scheme. The ﬁgure showshow the friction coefﬁcient and the adhesion factors affectthe speed-made-good when the density of the soil is ﬁxed at1,800 kg/m . It turns out that in the 2D case, we also ﬁxedthe density at 1,800 kg/m , so this slice should correspondto the prediction shown in Fig. 5(b). On ﬁrst observations, itappears that the main difference is along the decision boundarybetween the red and orange classes. As mentioned before,this is likely due to inaccurate labels that were the result ofnumerical instabilities in the DEM simulations. In future tests,it may be beneﬁcial to either reduce the time step when points Fig. 5. These two ﬁgures compare the predictions that resulted from usingrandom sampling (a) and our active learning paradigm (b). Each pointwas labeled by running a DEM simulation. Because simulations could beperformed in parallel, we queried points in batches of 32. In each batch,28 points were selected using active learning and the remaining 4 werechosen outside the region of disagreement using Algorithm 2. Each ﬁgurewas generated by training an ensemble of MLPs on 212 labeled instances. are queried in that region or to rerun a simulation using asmaller time step when a voting ﬁlter considers the label tobe questionable.Finally, in Fig. 9 we compared the convergence behavior ofrandom sampling with our active learning scheme. We foundthat our scheme provided a signiﬁcant beneﬁt over randomsampling. As an example, if we used random sampling totrain the ensemble with 95% accuracy, we would need to run404 simulations to generate the data. On the other hand, if weused our active learning scheme, we would only needed 148labeled points in order to exceed that same accuracy. That’snearly a reduction in the number of simulations by a factor of3. V. C

ONCLUSIONS AND F UTURE W ORK

We have demonstrated that query-by-bagging can be used tosigniﬁcantly reduce the number of physics-based simulations ig. 6. This ﬁgure compares the convergence behavior of random samplingwith our active learning paradigm. Notice that our active learning paradigmobtains a higher accuracy with only 116 points than random sampling with212 points.Fig. 7. (a) A scatter plot of the points that were queried using randomsampling. The colors indicate the label that was obtained by running a DEMsimulation for each point. (b) A scatter plot of the points that were queriedusing our active learning scheme. Notice that the scheme tends to query pointsnear the decision boundaries. Fig. 8. This ﬁgure shows the prediction that was generated by the 3D modelafter it was trained using our active learning scheme. In this ﬁgure, the densityof the soil was ﬁxed at 1800 kg/m , which corresponds to the density thatwas used in the 2D case.Fig. 9. This ﬁgure compares the convergence behavior of random samplingwith our active learning paradigm. Notice that our active learning paradigmexceeds 95% accuracy with only 148 points. This contrasts with randomsampling, which needs 404 points to reach 95% accuracy. that are needed to construct a mobility map when compared torandom sampling. In addition, we have expanded the featurespace and are able to accurately predict the speed-made-goodusing the friction coefﬁcient, the adhesion factor, and thedensity. Finally, we have provided a framework for generatingmobility maps that can be used to incorporate additional soilparameters.There are a number of interesting directions for future work.To begin with, more work needs to be done to determinewhich parameters are most important for predicting the speed-made-good. That is, feature reduction or feature extractiontechniques need to be utilized to reduce the dimension of thefeature space. This could be done using a technique such asprincipal component analysis or by training an autoencoder.Another direction could focus on expanding the featurepace to include multiple vehicle designs or multiple soiltypes (mud, snow, sand, etc.). This could be done by usinga one-hot encoding. By using a single classiﬁer with multiplevehicle designs or soil types, it may be possible to reduce theoverall number of simulations. This is because some of theinformation that’s learned for a single vehicle design or soiltype could prove to be useful when training the learner abouta new vehicle or soil type.Finally, it may be possible to use transient data to predictthe speed-made-good. When running simulations, we madesure that the vehicle’s speed came to a steady-state beforelabeling the point. However, there were many cases wherethe ﬁnal label was obvious long before the vehicle reacheda steady-state. Therefore, it may be reasonable to train aclassiﬁer to predict the label for the speed-made-good longbefore the simulation reaches a steady-state. This could helpto signiﬁcantly reduce the runtime for some of the simulations.VI. A CKNOWLEDGMENTS

We thank Dave Mechergui for several discussions pertainingto this work and Tamer Wasfy for help with DEM simula-tions. We acknowledge support from the Automotive ResearchCenter (ARC) in accordance with Cooperative AgreementW56HZV-19-2-0001 with U.S. Army CCDC Ground VehicleSystems Center. This research was supported in part throughcomputational resources and services provided by AdvancedResearch Computing Center at the University of Michigan,Ann Arbor. R

EFERENCES[1] P. W. Haley, M. Jurkat, and P. M. Brady Jr,

NATO Reference MobilityModel, Edition 1, Users Guide . Technical Report Number 12503. U.S.Army Tank-Automotive and Armaments Command, Warren, MI., 1979.[2] R. B. Ahlvin and P. W. Haley,

NATO reference mobility model: EditionII. NRMM user’s guide . US Army Engineer Waterways ExperimentStation, 1992.[3] J. Dasch, P. Jayakumar, P. Bradbury, R. Gonzalez, H. Hodges, A. Jain,K. Iagnemma, M. Letherwood, M. McCullough, J. Priddy, and B. Wo-jtysiak, “Et-148 next-generation nato reference mobility model (ng-nrmm),”

Science and Technology Organization Technical Report ET-148.North Atlantic Treaty Organization, Brussels , 2016.[4] P. Jayakumar and D. Mechergui, “Efﬁcient generation of accuratemobility maps using machine learning algorithms,” (To appear in)Journal of Terramechanics , 2020.[5] T. M. Wasfy, D. Mechergui, and P. Jayakumar, “Understanding theeffects of a discrete element soil model’s parameters on ground vehiclemobility,”

Journal of Computational and Nonlinear Dynamics , vol. 14,no. 7, 2019.[6] “The ﬂux hpc cluster.” https://arc-ts.umich.edu/ﬂux/.[7] D. Negrut, A. Tasora, M. Anitescu, H. Mazhar, T. Heyn, and A. Pazouki,“Solving large multibody dynamics problems on the gpu,” in

GPUComputing Gems Jade Edition , pp. 269–280, Elsevier, 2012.[8] E. Corona, D. Gorsich, P. Jayakumar, and S. Veerapaneni, “Tensortrain accelerated solvers for nonsmooth rigid body dynamics,”

AppliedMechanics Reviews , vol. 71, no. 5, 2019.[9] H. Yamashita, G. Chen, Y. Ruan, P. Jayakumar, and H. Sugiyama, “Hier-archical multiscale modeling of tire–soil interaction for off-road mobilitysimulation,”

Journal of Computational and Nonlinear Dynamics , vol. 14,no. 6, 2019.[10] S. De, E. Corona, P. Jayakumar, and S. Veerapaneni, “Scalable solversfor cone complementarity problems in frictional multibody dynamics,” in ,pp. 1–7, IEEE, 2019. [11] B. Settles, “Active learning,”

Synthesis Lectures on Artiﬁcial Intelligenceand Machine Learning , pp. 1–114, 2012.[12] C. A. Thompson, M. E. Califf, and R. J. Mooney, “Active learning fornatural language parsing and information extraction,”

ICML , pp. 406–414, 1999.[13] D. Reker, “Active-learning strategies in computer-assisted drug discov-ery,”

Drug Discovery Today , pp. 458–465, 2015.[14] S. Tong and D. Doller, “Support vector machine active learning with ap-plications to text classiﬁcation,”

Journal of Machine Learning Research ,pp. 45–66, 2001.[15] S. Tong and E. Chang, “Support vector machine active learning for im-age retrieval,”

Proceedings of the Ninth ACM International Conferenceon Multimedia , 2001.[16] S. C. H. Hoi, R. Jin, J. Zhu, and M. R. Lyu, “Batch mode active learningand its application to medical image classiﬁcation,”

Proceedings of the23rd International Conference on Machine Learning , 2006.[17] A. Stumpf, N. Lachiche, J.-P. Malet, N. Kerle, and A. Puissant, “Activelearning in the spatial domain for remote sensing image classiﬁcation,”

IEEE Transactions on Geoscience and Remote Sensing , pp. 2492–2507,2014.[18] T. M. Mitchell, “Generalization as search,”

Artiﬁcial Intelligence ,vol. 18, no. 2, 1982.[19] H. S. Seung, M. Opper, and H. Sompolinsky, “Query by committee,”

Proceedings of the Fifth Annual Workshop on Computational LearningTheory , pp. 287–294, 1992.[20] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby, “Selective samplingusing the query by committee algorithm,”

Machine Learning , 1997.[21] C. Campbell, N. Cristianini, and A. Smola, “Query learning with largemargin classiﬁers,”

Proceedings of the 17th International Conferene onMachine Learning , 2000.[22] N. Abe and H. Mamitsuka, “Query learning strategies using boosting andbagging,”

Machine Learning: Proceedings of the Fifteenth InternationalConference , 1998.[23] L. Breiman, “Bagging predictors,”

Machine Learning , vol. 24, no. 2,pp. 123–140, 1996.[24] D. D. Lewis and W. A. Gale, “A sequential algorithm for training textclassiﬁers,”

Proceedings of the 17th Annual ACM SIGIR Conference onResearch and Development in Information Retreival , pp. 3–12, 1994.[25] V. Nair and G. E. Hinton, “Rectiﬁed linear units improve restricted boltz-mann machines,” in

Proceedings of the 27th international conference onmachine learning (ICML-10) , pp. 807–814, 2010.[26] M.-F. Balcan, A. Beygelzimer, and J. Langford, “Agnostic active learn-ing,”

Journal of Computer and System Sciences , vol. 75, no. 1, pp. 78–89, 2009.[27] S. Verbaeten and A. Van Assche, “Ensemble methods for noise elim-ination in classiﬁcation problems,”