[PDF] Towards Interpretable-AI Policies Induction using Evolutionary Nonlinear Decision Trees for Discrete Action Systems

Abstract

Black-box AI induction methods such as deep reinforcement learning (DRL) are increasingly being used to find optimal policies for a given control task. Although policies represented using a black-box AI are capable of efficiently executing the underlying control task and achieving optimal closed-loop performance, the developed control rules are often complex and neither interpretable nor explainable. In this paper, we use a recently proposed nonlinear decision-tree (NLDT) approach to find a hierarchical set of control rules in an attempt to maximize the open-loop performance for approximating and explaining the pre-trained black-box DRL (oracle) agent using the labelled state-action dataset. Recent advances in nonlinear optimization approaches using evolutionary computation facilitates finding a hierarchical set of nonlinear control rules as a function of state variables using a computationally fast bilevel optimization procedure at each node of the proposed NLDT. Additionally, we propose a re-optimization procedure for enhancing closed-loop performance of an already derived NLDT. We evaluate our proposed methodologies (open and closed-loop NLDTs) on different control problems having multiple discrete actions. In all these problems our proposed approach is able to find relatively simple and interpretable rules involving one to four non-linear terms per rule, while simultaneously achieving on par closed-loop performance when compared to a trained black-box DRL agent. A post-processing approach for simplifying the NLDT is also suggested. The obtained results are inspiring as they suggest the replacement of complicated black-box DRL policies involving thousands of parameters (making them non-interpretable) with relatively simple interpretable policies. Results are encouraging and motivating to pursue further applications of proposed approach in solving more complex control tasks.

Full PDF

IInterpretable-AI Policies using Evolutionary Nonlinear Decision Trees forDiscrete Action Systems

Yashesh Dhebar , Kalyanmoy Deb , Subramanya Nageshrao , Ling Zhu , and Dimitar Filev Michigan State University, East Lansing, MI, 48824 USA Ford Motor Company, Detroit, USA { dhebarya,kdeb } @msu.com, { snageshr,lzhu40,dﬁlev } @ford.com COIN Lab Report: 2020018Abstract

Black-box artiﬁcial intelligence (AI) induction methods suchas deep reinforcement learning (DRL) are increasingly be-ing used to ﬁnd optimal policies for a given control task. Al-though policies represented using a black-box AI are capa-ble of efﬁciently executing the underlying control task andachieving optimal closed-loop performance – controlling theagent from initial time step until the successful termination ofan episode, the developed control rules are often complex andneither interpretable nor explainable . In this paper, we use arecently proposed nonlinear decision-tree (NLDT) approachto ﬁnd a hierarchical set of control rules in an attempt to max-imize the open-loop performance for approximating and ex-plaining the pre-trained black-box DRL (oracle) agent usingthe labelled state-action dataset. Recent advances in nonlin-ear optimization approaches using evolutionary computationfacilitates ﬁnding a hierarchical set of nonlinear control rulesas a function of state variables using a computationally fastbilevel optimization procedure at each node of the proposedNLDT. Additionally, we propose a re-optimization procedurefor enhancing closed-loop performance of an already derivedNLDT. We evaluate our proposed methodologies (open andclosed-loop NLDTs) on four different control problems hav-ing two to four discrete actions. In all these problems our pro-posed approach is able to ﬁnd simple and interpretable rulesinvolving one to four non-linear terms per rule, while simul-taneously achieving on par closed-loop performance whencompared to a trained black-box DRL agent. The obtainedresults are inspiring as they suggest the replacement of com-plicated black-box DRL policies involving thousands of pa-rameters (making them non-interpretable) with simple inter-pretable policies. Results are encouraging and motivating topursue further applications of proposed approach in solvingmore complex control tasks.

Control system problems are increasingly being solved byusing modern reinforcement learning (RL) and other ma-chine learning (ML) methods to ﬁnd an autonomous agent(or controller) to provide an optimal action A t for everystate variable combination S t in a given environment at ev-ery time step t . Execution of the output action A t takes theobject to the next state S t +1 in the environment and the pro- cess is repeated until a termination criteria is met. The map-ping between input state S t and output action A t is usu-ally captured through an artiﬁcial intelligence (AI) method.In the RL literature, this mapping is referred to as policy ( π ( S ) : S → A ) , where S is the state space and A is the action space . Sufﬁcient literature exists in efﬁcient trainingof these RL policies (Schulman et al. 2015, 2017; Lillicrapet al. 2015; Mnih et al. 2016). While these methods are ef-ﬁcient at training the AI policies for a given control systemtask, the developed AI policies, captured through compli-cated networks, are complex and non-interpretable.Interpretability of AI policies is important to a humanmind due to several reasons: (i) they help provide a better in-sight and knowledge to the working principles of the derivedpolicies, (ii) they can be easily deployed with a low ﬁdelityhardware, (iii) they may also allow an easier way to extendthe control policies for more complex versions of the prob-lem. While deﬁning interpretability is a subjective matter, anumber of past efforts have attempted to ﬁnd interpretableAI policies with limited success.In the remainder of this paper, we ﬁrst present the mainmotivation behind ﬁnding interpretable policies in Section 2.A few past studies in arriving at interpretable AI policies ispresented in Section 3. In Section 4, we review a recentlyproposed nonlinear decision tree (NLDT) approach in thecontext of arriving at interpretable AI policies. The overallopen-loop and closed-loop NLDT policy generation meth-ods are described in Section 5. Results on four control sys-tem problems are presented in Section 6. Finally, conclu-sions and future studies are presented in Section 7. Supple-mentary document provides further details. Various data analysis tasks, such as classiﬁcation, controllerdesign, regression, image processing, etc., are increasinglybeing solved using artiﬁcial intelligence (AI) methods.Theseare done, not because they are new and interesting, but be-cause they have been demonstrated to solve complex dataanalysis tasks without much change in their usual frame-works. With more such studies over the past few decades,they are faced with a huge challenge. Achieving a high-accuracy solution does not necessarily satisfy a curious do-main expert, particularly if the solution is not interpretableor explainable. A technique (whether AI-based or otherwise) a r X i v : . [ c s . L G ] S e p o handle data well is no more enough, researchers now de-mand an explanation of why and how they work.Consider the MountainCar control system problem,which has been extensively studied using various AI meth-ods (Sutton 1996; Peters, M¨ulling, and Altun 2010; Smartand Kaelbling 2000). The problem has two state variables(position x t along x -axis and velocity v t along positive x -axis) at every time instant t which would describe the stateof the car at t . Based on the state vector S t = ( x t , v t ) , apolicy π ( S ) must decide on one of the three actions A t : de-celerate ( A t = 0 ) along positive x -axis with a pre-deﬁnedvalue − a , do nothing ( A t = 1 ), or accelerate ( A t = 2 ) with a in positive x -axis direction. The goal of the control pol-icy π ( S ) is to take the under-powered car (it does not haveenough fuel to directly climb the mountain and reach thedestination) over the right hump in a maximum of 200 timesteps starting anywhere at the trough of the landscape. Phys-ical laws of motion are applied and a policy π ( S ) has beentrained to solve the problem. The RL produces a black-boxpolicy π oracle ( S ) for which an action A t ∈ [0 , , will beproduced for a given input S t = ( x t , v t ) ∈ R . Figure 1a (a) Using π oracle . (b) Using NLDT. Figure 1: State-action combinations for MountainCar prob.shows the state-action combinations obtained from 92 inde-pendent successful trajectories (amounting to total of 10,000time steps) leading to achieving the goal using a pre-traineddeterministic black-box policy π oracle . The x -location of thecar and its velocity can be obtained from a point on the 2Dplot. The color of the point S t = ( x t , v t ) indicates the action A t suggested by the oracle policy π oracle ( A t = 0 : blue, A t = 1 : orange, and A t = 2 : green). If a user is now in-terested in understanding how the policy π oracle chooses acorrect A t for a given S t , one way to achieve this would bethrough an interpretable policy function π int ( S t ) as follows: π int ( S t ) = (cid:40) , if φ ( S t ) is true , , if φ ( S t ) is true , , if φ ( S t ) is true , (1)where φ i ( S t ) : R → { , } is a Boolean function whichpartitions the state space S into two sub-domains based onits output value and for a given state S t , exactly one of φ i ( S t ) is true , thereby making the policy π int deterministic.If we re-look at Figure 1a we notice that the three actions arequite mixed at the bottom part of the x - v plot (state space).Thus, the partitioning Boolean functions φ i need to be quitecomplex in order to have φ ( S t ) = true for all blue points, φ ( S t ) = true for all orange points and φ ( S t ) = true forall green points. What we address in this study is an attempt to ﬁnd an approximated policy function π int ( S t ) which may not ex-plain all 100% time instance data corresponding to the oracleblack-box policy π oracle ( S t ) (Figure 1a), but it is fairly in-terpretable to explain close to 100% data. Consider the state-action plot in Figure 1b, which is generated with a simpleand interpretable policy π int ( S t ) = { i | φ i ( S t ) is true , i =1 , , } obtained by our proposed procedure as shown below φ ( S t ) = ¬ ψ ( S t ) ,φ ( S t ) = ( ψ ( S t ) ∧ ¬ ψ ( S t )) ,φ ( S t ) = ( ψ ( S t ) ∧ ψ ( S t )) , (2)where ψ ( S t ) = | . − . / (cid:98) x t + 0 . / (cid:98) v t − . (cid:98) x t (cid:98) v t | ≤ . , and ψ ( S t ) = | . − . (cid:98) x t − . (cid:98) v t | ≤ . .Here, (cid:98) x t and (cid:98) v t are normalized state variables (see Supple-mentary document for details). The action A t predicted us-ing the above policy does not match the output of π oracle at some states (about 8.1%), but from our experiments weobserve that it is still able to drive the mountain-car to thedestination goal located on the right hill in 99.8% episodes.Importantly, the policies are simplistic and amenable toan easier understanding of the relationships between x t and v t to make a near perfect control. Since the explanation pro-cess used the data from π oracle as the universal truth, thederived relationships will also provide an explanation of theworking of the black-box policy π oracle . A more gross ap-proximation to Figure 1a by more simpliﬁed relationships( φ i ) may reduce the overall open-loop accuracy of matchingthe output of π oracle . Hence, a balance between a good in-terpretability and a high open-loop accuracy in searching forBoolean functions φ i ( S t ) becomes an important matter forsuch an interpretable AI-policy development study.In this paper, we focus on developing a search procedurefor arriving at the ψ -functions (see Eq. 2) for discrete actionsystems. The structure of the policy π int ( S t ) shown in Eq. 1resembles a decision tree (DT), but unlike a standard DT,it involves a nonlinear function at every non-leaf node, re-quiring an efﬁcient nonlinear optimization method to arriveat reasonably succinct and accurate functionals. The proce-dure we propose here is generic and is independent of the AImethod used to develop the black-box policy π oracle . In (Noothigattu et al. 2018), an interpretable orchestratoris developed to choose from two RL-policies π C for max-imizing reward and π R for maximizing an ethical consid-eration. The orchestrator is dependent on only one of thestate-variables and despite it being interpretable, the poli-cies: π C and π R are still black-box and convoluted. (Maeset al. 2012) constructs a set of interpretable index basedpolicies and uses multi-arm bandit procedure to select ahigh performing index based policy. The search space of in-terepretable policies is much smaller and the procedure sug-gested for ﬁnding an interpretable policy is computationallyheavy, taking about hours to several days of computationaltime on simple control problems. In (Hein, Udluft, and Run-kler 2018), genetic programming (GP) is used to obtain in-terpretable policies on control tasks involving continuous ac-tions space through model-based policy learning. However2he interpretability was not captured in the design of the ﬁt-ness function and a large archive was created passively tostore every policy for each complexity encountered duringthe evolutionary search. A linear decision tree (DT) basedmodel is used in Liu et al. (2018) to approximate the Q-values of trained neural network. In that work, the split inDT occurs based on only one feature, and at each terminalnode the Q-function is ﬁtted using a linear model on all fea-tures. (Verma et al. 2018) uses a program sketch S to deﬁnethe domain of interpretable policies e . Interpretable policiesare found using a trained black-box oracle e N as a referenceby ﬁrst conducting a local search in the sketch space S tomimic the behaviour of the oracle e N and then ﬁne-tuningthe policy parameters through online Bayesian optimiza-tion. The bias towards generating interpretable programs isdone through controlled initialization and local search ratherthan explicitly capturing interpretability as one of the ﬁtnessmeasure. Particle swarm optimization (Kennedy and Eber-hart 1995) is used to generate interpretable fuzzy rule setin (Hein et al. 2017) and is demonstrated on classic con-trol problems involving continuous actions. Works on DT(Breiman 2017) based policies through imitation learninghas been carried out in (Ross, Gordon, and Bagnell 2011).(Bastani, Pu, and Solar-Lezama 2018) extends this to uti-lize Q-values and eventually render DT policies involving < , nodes on some toy games and CartPole environ-ment with an ultimate aim to have the induced policies ver-iﬁable. (Bastani, Kim, and Bastani 2017) used axis-alignedDTs to develop interpretable models for black-box classi-ﬁers and RL-policies. They ﬁrst derive a distribution func-tion P by ﬁtting the training data through axis-aligned Gaus-sian distributions. P is then used to compute the loss func-tion for splitting the data in the DT. (Vandewiele et al. 2016)attempts to generate interpretable DTs from an ensembleusing a genetic algorithm. In (Ernst, Geurts, and Wehenkel2005), regression trees are derived using classical methodssuch as CART (Breiman 2017) and Kd-tree (Bentley 1975)to model Q-function through supervised training on batch ofexperiences and comparative study is made with ensembletechniques. In (Silva et al. 2020), a gradient based approachis developed to train the DT of pre-ﬁxed topology involvinglinear split rules. These rules are later simpliﬁed to allowonly one feature per split node and resulting DTs are prunedto generate simpliﬁed rule-set.While the above methods attempt to generate an inter-pretable policy, the search process does not use complexity of policy in the objective function, instead, they rely on theinitializing the search with certain interpretable policies. Inour approach described below, we build an efﬁcient searchalgorithm to directly ﬁnd interpretable policies using recentadvances in nonlinear optimization. In this study, we use a direct mathematical rule generationapproach (presented in Eq. 2) using a nonlinear decisiontree (NLDT) approach (Dhebar and Deb 2020), which webrieﬂy describe here. The intention is to model the inter-pretable policy π int to approximate and explain the pre-trained black-box policy π oracle using the labelled state- action data generated using π oracle . Decision trees are con-sidered a popular choice due to their interpretability aspects.They are intuitive and each decision can be easily inter-preted. However, in a general scenario, regular decision treescome up with a complicated topology since the rules at eachconditional node can assume only axis parallel structure x i ≤ τ to make a split. On the other end, single rule basedclassiﬁers like support vector machines (SVMs) have justone rule but its complicated and highly nonlinear. Keepingthese two extremes in mind, we develop a nonlinear decisiontree framework where each conditional node can assume anonlinear functional form while the tree is allowed to growby recursively splitting the data in conditional nodes, simi-lar to the procedure used to induce regular decision trees. Inour case of replicating a policy π oracle , the conditional nodecaptures a nonlinear control logic and the terminal leaf nodesindicate the action . This is schematically shown in Figure 2.In the binary-split NLDT, used in this study, a conditionalnode is allowed to have exactly two splits as shown in Fig-ure 2. The non-linear split rule f ( x ) at each conditional nodeFigure 2: Schematic of a binary-split NLDT.is expressed as a weighted sum of power laws: f ( x ) = (cid:26)(cid:80) pi =1 w i B i + θ , if m = 0 , | (cid:80) pi =1 w i B i + θ | − | θ | , if m = 1 , (3)where power-laws B i are given as B i = (cid:81) dj =1 x b ij j and m indicates if an absolute operator should be present in the ruleor not. In Section 5.1, we discuss procedures to derive valuesof exponents b ij , weights w i , and biases θ i . The overall approach is illustrated in Figure 3. First, a dedi-cated black-box policy π oracle is trained from the actual en-vironment/physics of the problem. This aspect is not the fo-cus of this paper. Next, the trained policy π oracle (Block 1in the ﬁgure) is used to generate labelled training and test-ing datasets of state-action pairs from different time steps.We generate two types of training datasets: Regular – asthey are recorded from multiple episodes, and

Balanced – selected from multiple episodes to have almost equal num-ber of states for each action, where an episode is a completesimulation of controlling an object with a policy over mul-tiple time steps. Third, the labelled training dataset (Block2) is used to ﬁnd the NLDT (Block 3) using the recursivebilevel evolutionary algorithm described in Section 5.1. Wecall this an open-loop NLDT (or, NLDT OL ), since it is de-rived from a labelled state-action dataset generated from π oracle , without using any overall reward or any ﬁnal goal3igure 3: A schematic of the proposed overall approach.objective in its search process, which is typically a casewhile doing reinforcement learning. Use of labelled state-action data in supervised manner allows a faster search ofNLDT even with a large dataset as compared to construct-ing the NLDT from scratch through reinforcement learningby interacting with the environment to maximize the cumu-lative rewards (Verma et al. 2018). Next, in an effort to makethe overall NLDT interpretable while simultaneously ensur-ing better closed-loop performance, we prune the NLDT bytaking only the top part of NLDT OL (we call NLDT ( P ) OL inBlock 4) and re-optimize all non-linear rules within it forthe weights and biases using an efﬁcient evolutionary opti-mization procedure to obtain ﬁnal NLDT* (Block 5). The re-optimization is done here with closed-loop objectives, suchas the cumulative reward function or closed-loop completionrate. We brieﬂy discuss the open-loop training procedure ofinducing NLDT OL and the closed-loop training procedureto generate NLDT* in next sections. A labelled state-action dataset is ﬁrst created using a pre-trained black-box policy π oracle . Since we are dealing withdiscrete-action control problems, the underlying imitationtask of replicating the behavior of π oracle using the labelledstate-action data translates to a classiﬁcation problem. Wetrain NLDT discussed in Section 4 to ﬁt the state-action datathrough supervised learning. Nonlinear split-rule f ( x ) ateach conditional node (Figure 2 and Eq. 3) is derived usinga dedicated bilevel optimization algorithm, where the upperlevel searches the template of the non-linear rule and the cor-responding lower level focuses at estimating optimal valuesof weights/coefﬁcients for optimal split of data present in theconditional node. The optimization formulation for derivinga non-linear split rule f ( x ) (Eq. 3) at a given conditionalnode is given below:Minimize F U ( B , m, w ∗ , θ ∗ ) , subject to ( w ∗ , θ ∗ ) ∈ argmin (cid:8) F L ( w , θ ) | ( B ,m ) (cid:12)(cid:12) F L ( w , θ ) | ( B ,m ) ≤ τ I , − ≤ w i ≤ , ∀ i, θ ∈ [ − , m +1 (cid:9) ,m ∈ { , } , b ij ∈ Z , (4) where Z is a set of exponents allowed to limit the com-plexity of the derived rule structure. In thus study, we use Z = {− , − , − , , , , } . The objective F U quantiﬁesthe complexity of the non-linear rule by enumerating thenumber of terms present in the equation of the rule f ( x ) as shown below: F U ( B , m, w ∗ , θ ∗ ) = p (cid:88) i =1 d (cid:88) j =1 g ( b ij ) , (5)where g ( α ) = 1 , if α (cid:54) = 0 , zero otherwise. m indicates thepresence or absence of a modulus operator and w and θ en-codes rule weights w i and biases θ i respectively. The lowerlevel objective function F L quantiﬁes the net impurity ofchild nodes resulting from the split. Impurity I of a node P is computed using a Gini-score: Gini ( P ) = 1 − (cid:80) ci (cid:0) N i N (cid:1) ,where N is the total number of points present in the nodeand N i represents number of points belonging to class i .Datapoints present in node P gets distributed into two non-overlapping subsets based on their split function value. Dat-apoints with f ( x ) ≤ go to the left child node L and rest goto the right child node R . The lower level objective function F L which quantiﬁes the quality of this split is then given by F L ( w , θ ) | ( B ,m ) = (cid:18) N L N P Gini ( L ) + N R N P Gini ( R ) (cid:19) ( w , θ , B ,m ) . (6) The τ I parameter in Eq. 4 represents maximum allow-able net-impurity (Eq. 6) of child nodes. The resulting childnodes obtained after the split undergo another split and theprocess continues until one of the termination criteria is met.We use a bilevel-optimization algorithm (Sinha, Malo,and Deb 2018) to derive split-rule f i ( x ) at i -th conditionalnode in NLDT. The upper level of the optimization navi-gates through the domain of discrete exponents b ij to pre-scribe the structure of the rule. Then, the lower level opti-mization ﬁnds optimal values of weights w i and biases θ i ofthe rule structure to make the overall NLDT search efﬁcient.After the entire NLDT is found, in this study, a pruning andtree simpliﬁcation strategy (see Supplementary Documentfor more details) is applied to reduce the size of NLDT in an4ffort to improve on the interpretability of the overall rule-sets. This entire process of inducing NLDT from the labelledstate-action data results into the open-loop interpretable tree– NLDT OL . NLDT OL can then be used to explain the be-havior of the oracle policy π oracle . We will see in Section 6that despite being not 100% accurate in imitating π oracle ,NLDT OL manages to achieve respectable closed-loop per-formance with 100% completion rate and a high cumulativereward value. Next, we discuss the closed-loop training pro-cedure to obtain NLDT*. The intention behind the closed-loop training is to enhancethe closed-loop performance of the interpretable NLDT. Itwill be discussed in Section 6 that while closed-loop perfor-mance of NLDT OL is at par with π oracle on control tasksinvolving two to three discrete actions, like CartPole andMountainCar, the NLDT OL struggles to autonomously con-trol the agent for control problems such as LunarLander hav-ing more states and actions. In closed-loop training, we ﬁne-tune and re-optimize the weights W and biases Θ of an en-tire NLDT OL (or pruned NLDT OL , i.e. NLDT ( P ) OL – block4 in Figure 3) to maximize its closed-loop ﬁtness ( F CL ),which is expressed as the average of the cumulative rewardcollected on M episodes:Maximize F CL ( W , Θ ) = 1 M M (cid:88) i =1 R e ( W , Θ ) , Subject to W ∈ [ − , n w , Θ ∈ [ − , n θ , (7)where n w and n θ are total number of weights and biasesappearing in entire NLDT and M = 20 in our case. In this section, we present results obtained by using our ap-proach for control tasks on four problems: (i) CartPole, (ii)CarFollowing, (iii) MountainCar, and (iv) LunarLander. Theﬁrst two problems have two discrete actions, third problemhas three discrete actions, and the fourth problem has fourdiscrete actions. The open-loop statistics are reported us-ing the scores of training and testing accuracy on labelledstate-action data generated from π oracle . For quantifying theclosed-loop performance, we use two metrics: (i) Comple-tion Rate which gives a measure on the number of episodeswhich are successfully completed, and (ii)

Cumulative Re-ward which quantiﬁes how well an episode is executed. Foreeach problem, 10 runs of open-loop training are executedusing 10,000 training datapoints. Open-loop statistics ob-tained from these 10 independent runs of 10,000 trainingand 10,000 test data each are reported. We choose the me-dian performing NLDT OL for closed-loop analysis. We run50 batches with 100 episodes each and report statistics ofcompletion-rate and cumulative reward. This problem comprises of four state variables and is con-trolled using two actions - move left and move right withan objective to stabilize an inverted pendulum on a cart (see Supplementary Document for more details). We conduct anablation study to show the effect of training-data size on theopen-loop and closed-loop performance of NDLT OL . Theresults for this study are shown in Table 1. It is observedthat NLDT OL trained with at least 5,000 data points showsa robust open-loop performance.The obtained NLDT OL hasa about two rules with on an average three terms in the de-rived policy function. Interestingly, the same NLDT (with-out closed-loop training) also produces closed-loopperformance by achieving the maximum cumulative rewardvalue of 200. We have developed a discretized version of the car followingproblem discussed in (Nageshrao, Costa, and Filev 2019),wherein the task is to follow the car in the front which moveswith a random acceleration proﬁle (between − m/s and +1 m/s ) and maintain a safe distance of d safe = 30 m from it. The rear car is controlled using two discrete acceler-ation values of +1 m/s (action 0) and − m/s (action 1).The car-chase episode terminates when the relative distance d rel = x front − x rel is either zero (i.e. collision case) oris greater than 150 m. At the start of the simulation, boththe cars start with the initial velocity of zero. A DNN pol-icy for CarFollowing problem was obtained using a doubleQ-learning algorithm (Van Hasselt, Guez, and Silver 2015).The reward function for the CarFollowing problem is shownin the Supplementary document, indicating that a relativedistance close to 30 m produces the highest reward. It is tonote here that unlike the CartPole control problem, wherethe dynamics of the system was deterministic, the dynamicsof the CarFollowing problem is not deterministic due to therandom acceleration proﬁle with which the car in the frontmoves. This randomness introduced by the unpredictable be-haviour of the front car makes this problem more challeng-ing.Results for the CarFollowing problem are shown in Ta-ble 2. An average open-loop accuracy of 96.53% is achievedwith at most three rules, each having 3.28 terms on anaverage. For this problem, we apply the closed-loop re-optimization (Blocks 4 and 5 to produce Block 6 in Figure 3)on the entire NLDT OL . As shown Table 3, NLDT* is able toachieve better closed-loop performances (100% completionrate and slightly better average cumulative reward). Figure 4shows that NLDT* adheres the 30 m gap between the carsmore closely than original DNN or NLDT OL .Results of NLDT’s performance on problems with twodiscrete actions (Tables 1, 2 and 3) indicate that despitehaving a noticeable mismatch with the open-loop output ofthe oracle black-box policy π oracle , the closed-loop perfor-mance of NLDT is at par or at times better than π oracle . Thisobservation suggests that certain state-action pairs are not ofcrucial importance when it comes to executing the closed-loop control and, therefore, errors made in predicting thesestate-action events do not affect and deteriorate the closed-loop performance.5able 1: Effect of training data size to approximate performance of NLDT OL on CartPole problem.TrainingData Size TrainingAccuracy Test. Accuracy(Open-loop) . ± . . ± .

40 1 . ± . . ± . . ± .

32 95 . ± . . ± .

53 79 . ± .

10 1 . ± .

54 3 . ± .

60 175 . ± .

61 51 . ± . . ± .

87 90 . ± .

87 1 . ± .

40 4 . ± . . ± .

00 100 ± . . ± .

28 92 . ± .

27 1 . ± .

46 4 . ± . . ± .

00 100 ± . . ± . . ± .

10 1 . ± . . ± . . ± .

00 100 ± . Table 2: Results on CarFollowing problem.Train.Acc. Test.Acc. Depth . ± .

97 96 . ± .

90 1 . ± .

30 2 . ± .

66 3 . ± .

65 100 ± . Table 3: Closed-loop performance analysis after re-optimizing NLDT for CarFollowing problem (k = 10 ).AI Cumulative Reward Compl. RateBest Avg ± StdDNN 174.16k . k ± . ± . NLDT OL . k ± . ± . NLDT* . k ± .

95 100 ± . This problem comprises two state-variables to capture x po-sition and velocity of the car. The task is to use there actionsand drive the under-powered car to the destination (see Sup-plementary Document for more details).Compilation of results of the NLDT OL induced usingtraining data sets comprising of different data distributions( regular and balanced ) is presented in Table 4. A state-action plot obtained using π oracle and one of the NLDTpolicy corresponding to the ﬁrst row of Table 4 is pro-vided in Figures 1a and 1b, respectively. It is observed thatabout mismatch in the open-loop performance (i.e. test-ing accuracy in Table 4) comes from the lower-left regionof state-action plot (Figures 1a and 1b) due to highly non-linear nature of π oracle . Despite having this mismatch, ourinterpretable NDLT policy is able to achieve close to closed-loop control performance with an average of 2.4 ruleshaving 2.97 terms. Also, NLDT trained on balanced datasetTable 4: Results on MountainCar problem.Data Train.Acc. Test.Acc. Depth . ± .

57 91 . ± . . ± . . ± .

49 2 . ± . . ± . Bal. . ± .

36 87 . ± . . ± . . ± .

60 3 . ± . ± . (2nd row of Table 4) is able to chieve 100% closed-loop per-formance and involves about three control rules with an av-erage 1.67 terms in each rule. R e l a t i v e D i s t a n c e d safe = 30DNNNLDTNLDT* Figure 4: Relative distance plot for CarFollowing.

The task in this problem is to control the lunar-lander usingfour discrete actions and successfully land it on the lunarterrain. The state of the lunar-lander is expressed with eightstate variables, of which six are continuous, and two are cat-egorical. More details for this problem are provided in theSupplementary document.Table 5: NDLT OL with depths 3 and 6 for LunarLander.Data Depth Train.Acc. Test.Acc. . ± .

78 76 . ± .

36 5 . ± .

49 5 . ± .

75 14 . ± . Bal. 3 . ± .

82 66 . ± .

03 4 . ± .

66 5 . ± .

31 42 . ± . Reg. 6 . ± .

65 81 . ± . . ± .

83 4 . ± .

34 48 . ± . Bal. 6 . ± .

78 71 . ± .

24 25 . ± .

83 5 . ± . . ± . Table 5 provides the compilation of results obtained usingNLDT OL . In this problem, while a better open-loop perfor-mance occurs for regular dataset, a better closed-loop per-formance is observed when the NLDT open-loop trainingis done on the balanced dataset. Also, NLDT OL with depththree are not adequate to achieve high closed-loop perfor-mance. The best performance is observed using balanceddataset where NLDT OL achieves 93% episode completionrate. A speciﬁc NLDT OL with 26 rules each having about4.15 terms is shown in the Supplementary Document.6able 6: Closed-loop performance on LunarLander problem with and without re-optimization on 26-rule NLDT OL . Number ofrules are speciﬁed in brackets for each NLDT and total parameters for the DNN is marked.Re-Opt. NLDT-2 (2) NLDT-3 (4) NLDT-4 (7) NLDT-5 (13) NLDT-6 (26) DNN (4,996)Cumulative RewardBefore − . ± .

29 42 . ± .

83 54 . ± .

44 56 . ± . . ± . . ± . After − . ± .

51 231 . ± . . ± . . ± .

92 214 . ± . Completion RateBefore . ± .

00 51 . ± .

26 82 . ± .

80 79 . ± . . ± . . ± . After . ± .

38 96 . ± . . ± . . ± .

59 94 . ± . It is understandable that a complex control task involv-ing many state variables cannot be simpliﬁed or made in-terpretable with just one or two control rules. Next, we usea part of the NLDT OL from the root node to obtain thepruned NLDT ( P ) OL (step ‘B’ in Figure 3) and re-optimizeall weights ( W ) and biases ( Θ ) using the procedure dis-cussed in Section 5.2 (shown by orange box in Figure 3) toﬁnd closed-loop NLDT*. Table 6 shows that for the prunedNLDT-3 which comprises of the top three layers and in-volves only four rules of original 26-rule NLDT OL (i.e.NLDT-6), the closed-loop performance increases from 51%to 96% (NLDT*-3 results in Table 6) after re-optimizing itsweights and biases with closed-loop training. The resultingNLDT with its associated four rules are shown in Figure 5.Figure 5: Final NLDT*-3 for LunarLander prob. (cid:98) x i is a nor-malized state variable (see Supplementary Document).As shown in Table 6, the NLDT* with just two rules(NLDT-2) is too simplistic and does not recover well afterre-optimization. However, the NLDT*s with four and sevenrules achieve a near 100% closed-loop performance. Clearly,an NLDT* with more rules (NLDT-5 and NLDT-6) are notworth considering since both closed-loop performances andthe size of rule-sets are worse than NLDT*-4. Note thatDNN produces a better reward, but not enough completionrate, and the policy is more complex with 4,996 parameters. In this paper, we have proposed a two-step strategy to ar-rive at hierarchical and interpretable rulesets using a non-linear decision tree (NLDT) concept to facilitate an expla-nation of the working principles of AI-based policies. TheNLDT training phases use recent advances in nonlinear op- timization to focus its search on rule structure and detailsdescribing weights and biases of the rules by using a bileveloptimization algorithm. Starting with an open-loop train-ing, which is relatively fast but uses only time-instant state-action data, we have proposed a ﬁnal closed-loop trainingphase in which the complete or a part of the open-loopNLDT is re-optimized for weights and biases using com-plete episode data. Results on four popular discrete actionproblems have amply demonstrated the usefulness of theproposed overall approach.This proof-of-principle study encourages us to pursue anumber of further studies. First, the scalability of the inter-pretable NLDT approach to large-dimensional state-actionspace problems must now be explored. A previous studyon NLDT (Dhebar and Deb 2020) on binary classiﬁcationof dominated versus non-dominated data in multi-objectiveproblems was successfully extended to 500-variable prob-lems. While it is encouraging, the use of customizationmethods for initialization and genetic operators using prob-lem heuristics and/or recently proposed innovization meth-ods (Deb and Srinivasan 2006) in the upper level prob-lem can be tried. Second, this study has used a compu-tationally fast open-loop accuracy measure as the ﬁtnessfor evolution of the NLDT OL . This is because, in gen-eral, an NLDT OL with a high open-loop accuracy is likelyto achieve a high closed-loop performance. However, wehave observed here that a high closed-loop performanceis achievable with a NLDT OL having somewhat degradedopen-loop performance, but re-optimized using closed-loopperformance metrics. Thus, a method to identify the cru-cial (open-loop) states from the AI-based controller dataset that improves the closed-loop performance would be an-other interesting step for deriving NLDT OL . This may elim-inate the need for re-optimization through closed-loop train-ing. Third, a more comprehensive study using closed-loopperformance and respective complexity as two conﬂictingobjectives for a bi-objective NLDT search would producemultiple trade-off control rule-sets. Such a study can, notonly make the whole search process faster due to the ex-pected similarities among multiple policies, they will alsoenable users to choose a single policy solution from a set ofaccuracy-complexity trade-off solutions.7 eferences Bastani, O.; Kim, C.; and Bastani, H. 2017. Interpretability viamodel extraction. arXiv preprint arXiv:1706.09773 .Bastani, O.; Pu, Y.; and Solar-Lezama, A. 2018. Veriﬁable rein-forcement learning via policy extraction. In

Advances in neuralinformation processing systems (NIPS) , 2494–2504.Bentley, J. L. 1975. Multidimensional binary search trees used forassociative searching.

Communications of the ACM

IEEE Access

8: 89497–89509.Breiman, L. 2017.

Classiﬁcation and regression trees . Routledge.Deb, K. 2005.

Multi-objective optimization using evolutionary al-gorithms . Wiley.Deb, K.; and Agrawal, R. B. 1995. Simulated binary crossover forcontinuous search space.

Complex systems

Proceedingsof the 9th annual conference on Genetic and evolutionary compu-tation , 1187–1194.Deb, K.; and Srinivasan, A. 2006. Innovization: Innovating designprinciples through optimization. In

Proceedings of the 8th annualconference on Genetic and evolutionary computation , 1629–1636.ACM.Dhebar, Y.; and Deb, K. 2020. Interpretable Rule Discov-ery Through Bilevel Optimization of Split-Rules of NonlinearDecision Trees for Classiﬁcation Problems. arXiv preprintarXiv:2008.00410 .Ernst, D.; Geurts, P.; and Wehenkel, L. 2005. Tree-based batchmode reinforcement learning.

Journal of Machine Learning Re-search

Foundations ofgenetic algorithms , volume 1, 69–93. Elsevier.Hein, D.; Hentschel, A.; Runkler, T.; and Udluft, S. 2017. Par-ticle swarm optimization for generating interpretable fuzzy rein-forcement learning policies.

Engineering Applications of ArtiﬁcialIntelligence

65: 87–98.Hein, D.; Udluft, S.; and Runkler, T. A. 2018. Interpretable policiesfor reinforcement learning by genetic programming.

EngineeringApplications of Artiﬁcial Intelligence

76: 158–169.Kennedy, J.; and Eberhart, R. 1995. Particle swarm optimization.In

Proceedings of ICNN’95-International Conference on NeuralNetworks , volume 4, 1942–1948. IEEE.Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa,Y.; Silver, D.; and Wierstra, D. 2015. Continuous control with deepreinforcement learning. arXiv preprint arXiv:1509.02971 .Liu, G.; Schulte, O.; Zhu, W.; and Li, Q. 2018. Toward inter-pretable deep reinforcement learning with linear model u-trees. In

Joint European Conference on Machine Learning and KnowledgeDiscovery in Databases , 414–429. Springer.Maes, F.; Fonteneau, R.; Wehenkel, L.; and Ernst, D. 2012. Policysearch in a space of simple closed-form formulas: towards inter-pretability of reinforcement learning. In

International Conferenceon Discovery Science , 37–51. Springer.Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley,T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methodsfor deep reinforcement learning. In

International Conference onMachine Learning (ICML) , 1928–1937. Nageshrao, S.; Costa, B.; and Filev, D. 2019. Interpretable Ap-proximation of a Deep Reinforcement Learning Agent as a Set ofIf-Then Rules. In , 216–221. IEEE.Noothigattu, R.; Bouneffouf, D.; Mattei, N.; Chandra, R.; Madan,P.; Varshney, K.; Campbell, M.; Singh, M.; and Rossi, F. 2018. In-terpretable multi-objective reinforcement learning through policyorchestration. arXiv preprint arXiv:1809.08343 .Peters, J.; M¨ulling, K.; and Altun, Y. 2010. Relative entropy policysearch. In

Proceedings of the Twenty-Fourth AAAI Conference onArtiﬁcial Intelligence (AAAI-10) , volume 10, 1607–1612. Atlanta.Ross, S.; Gordon, G.; and Bagnell, D. 2011. A reduction of imita-tion learning and structured prediction to no-regret online learning.In

Proceedings of the Fourteenth International Conference on Ar-tiﬁcial Intelligence and Statistics , 627–635.Rummery, G. A.; and Niranjan, M. 1994. On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR166, University of Cambridge, Department of Engi-neering Cambridge, UK.Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz, P.2015. Trust region policy optimization. In

International Confer-ence on Machine Learning (ICML) , 1889–1897.Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov,O. 2017. Proximal policy optimization algorithms. arXiv preprintarXiv:1707.06347 .Silva, A.; Gombolay, M.; Killian, T.; Jimenez, I.; and Son, S.-H.2020. Optimization methods for interpretable differentiable deci-sion trees applied to reinforcement learning. In

International Con-ference on Artiﬁcial Intelligence and Statistics , 1855–1865.Sinha, A.; Malo, P.; and Deb, K. 2018. A Review on Bilevel Op-timization: From Classical to Evolutionary Approaches and Appli-cations.

IEEE Transactions on Evolutionary Computation

International Conference on Ma-chine Learning (ICML) , 903–910.Sutton, R. S. 1996. Generalization in Reinforcement Learning:Successful Examples Using Sparse Coarse Coding. In Touretzky,D. S.; Mozer, M. C.; and Hasselmo, M. E., eds.,

Advances in Neu-ral Information Processing Systems 8 , 1038–1044. MIT Press.Van Hasselt, H.; Guez, A.; and Silver, D. 2015. Deep reinforcementlearning with double q-learning arXiv preprint arXiv:1509.06461 .Vandewiele, G.; Janssens, O.; Ongenae, F.; De Turck, F.; andVan Hoecke, S. 2016. Genesim: Genetic extraction of a single,interpretable model. arXiv preprint arXiv:1611.05722 .Verma, A.; Murali, V.; Singh, R.; Kohli, P.; and Chaudhuri, S.2018. Programmatically interpretable reinforcement learning. arXiv preprint arXiv:1804.02477 . upplementary Document A Additional Information about theProposed Method

A number of additional information about the proposedNLDT OL and NLDT* search procedures are provided here. A.1 Data Normalization

First, we provide the exact normalization of state var iablesperformed before the open-loop learning task is executed.Before training and inducing the non-linear decision tree(NLDT), features in the dataset are normalized using the fol-lowing equation: (cid:98) x i = 1 + ( x i − x min i ) / ( x max i − x min i ) , (8)where x i is the original value of the i -th feature, (cid:98) x i is thenormalized value of the i -th feature, x min i and x max i are min-imum and maximum value of i -th feature as observed in thetraining dataset. This normalization will make every feature x i to lie within [1 , . This is done to ensure that x i = 0 isavoided to not cause a division by zero error. A.2 Pruning and Tree Simpliﬁcation for NLDT OL Next, we discuss the pruning process performed to theNLDT OL to keep it within a reasonable depth and alsoachieve a reasonable open-loop accuracy. The NLDT rep-resenting our interpretable AI is induced using successiveheirarchical spliting algorithm. A dedicated bilevel approachis used to derive the split rule for each conditional node, i.e.,if a child node created after the split is still impure (with itsimpurity I > τ I ), it is subjected to further split. Initially,we allow the tree to grow to a pre-speciﬁed maximum depthof d max . The resulting tree is fairly complicated with abouthundreds of split nodes. Thus, we simplify this tree further tolower depths and remove redundant splits by pruning them.Lower depth trees are relatively simpler than the full growndepth d max tree and also have better generalizability. A.3 Reduction of NLDT OL for Closed-loopOptimization If the obtained NLDT OL has a reasonable number of rules(say, less than ﬁve) and each rule has a reasonable complex-ity (with fewer terms in each rule), the overall NLDT OL maybe acceptable and a closed-loop optimization may still beperformed to obtain a ﬁne-tuned NLDT* with better closed-loop performance.However, in complex problems, the obtained NLDT OL may have many rules ( > ), thereby making the NLDT OL somewhat un-interpretable, despite the interpretable struc-ture of each rule. The sheer number of rules will make thewhole solution difﬁcult to comprehend. To alleviate, we pro-pose to reduce the size of NLDT OL so that ﬁve or fewer rulesare retained from the root of NLDT OL . Since it is not knownbefore the closed-loop analysis how many rules would pro-duce an acceptable closed-loop performance, we propose to choose a few dissections of the NLDT OL from the root node,so that they contain 1-5 rules. Then, the closed-loop opti-mization can be performed to each dissection and a size-accuracy trade-off can be obtained. Based on this analysis,a ﬁnal solution can then be chosen. We have illustrated thisanalysis for the LunarLander environment in the main pa-per in Table 6. Some instances of the dissection approach isillustrated in Figure A.7 of this document. A.4 Differences between Open-loop andClosed-loop Searches

The overall search procedure described in Figure 3 in themain paper clearly indicated that it is a two-step optimiza-tion procedure. In the ﬁrst optimization procedure, an open-loop NLDT (NLDT OL ) is evolved using a bilevel optimiza-tion approach applied recursively to derive split-rule f ( x ) at each conditional node. Here, each training datapoint con-sists of a time-instant state-action pair obtained using oraclepolicy π oracle . One of the objective function of the overallbilevel algorithm is the minimization of the weighted Gini-score ( F L , Eq. 6 in main paper), which quantiﬁes the purity of nodes created after the split. This measure can also serveas a proxy to indicate the error between predicted action andthe AI-model action. For a node P , the Gini-score is com-puted as Gini( P ) = 1 − c (cid:88) i =1 (cid:18) N i N (cid:19) , (9)where N is the total number of datapoints in node P and N i is the number of datapoints present in node P which belongsto action - i . As can be seen from Eq. 9, the computation ofGini-score is computationally cheap and fast. This eventu-ally makes the computation of F L (Eq. 6 in main paper) tobe cheap and fast, a feature which is desired for any bilevel-algorithm since for each solution member in the upper-level of the search, a dedicated full run of lower-level optimiza-tion performed and if the lower-level objective function iscomputationally taxing then it will make the overall bilevelalgorithm extremely slow. Additionally, it is to note herethat every rule structure ( f j ( x ) ) starting from the root node( j = 0 ) is optimized independently by using a subset ofthe training data dictated by the completed NLDT thus far.Nowhere in the development of the NLDT OL , any closed-loop evaluation function (such as, a cumulative reward func-tion of completing the task, or success rate of completion)is used in the optimization process. The structure of theNLDT OL and structure of every rule (with its mathematicalstructure and coefﬁcients/biases associated with each rule)are evolved. Due to the vastness of the search space of thisoptimization task, we developed a computationally efﬁcientbilevel optimization procedure composing of a computation-ally cheap and fast lower-level objective. The two levels al-low the structure of each rule and the associated coefﬁcientsand biases to be learnt in a hierarchical manner. This is also9ossible due to recent advances in nonlinear optimizationusing hybrid evolutionary and point-based local search al-gorithms (Dhebar and Deb 2020).On the contrary, the closed-loop optimization restricts itssearch to a ﬁxed NLDT structure (which is either identicalto NLDT OL or a part of it from the root node, as illustratedin Figure A.7), but modiﬁes the coefﬁcients and biases ofall rules simultaneously in order to come up with a betterclosed-loop performance. Here, an entire episode (a seriesof time-instance state-action pairs from start ( t = 0 ) to ﬁn-ish ( t = T )) can be viewed as a single datapoint. As an ob-jective function, the average of cumulative-reward collectedacross 20 episodes, each with a random starting state S is used to make a better evaluation of the resulting NLDT.Due to this aspect, the computational burden is more, butthe search process stays in a single level. We employ an efﬁ-cient real-parameter genetic algorithm with standard param-eter settings (Deb and Agrawal 1995; Deb 2005). To makethe search more efﬁcient, we include the NLDT OL (or itspart, as the case may be) in the initial population of solu-tions for the closed-loop search.The differences between the two optimization tasks aresummarized in Table A.1. As discussed, both optimizationtasks have their role in the overall process. While evalua-tion of a solution in the open-loop optimization is computa-tionally quicker, it does not use a whole episode in its eval-uation process to provide how the resulting rule or NLDTperform on the overall task. The goal here is to maximizethe state-action match with the true action as prescribed by π oracle . This task builds a complete NLDT structure fromnothing by ﬁnding an optimized rule for every conditionalnode. The use of a bilevel optimization, therefore, is needed.On the other hand, keeping a part (or whole) of the NLDT OL structure ﬁxed, the closed-loop optimization ﬁne-tunes allassociated rules to maximize the cumulative reward R total .A closed-loop optimization alone on episodic time-instancedata to estimate R total will not be computationally tractablein complex problems. B Problems Used in the Study

In this section, we provide a detail description of the fourenvironments used in this study.

B.1 CartPole Environment

The CartPole problem comprises of four state variables: 1) x -position ( x → x ), velocity in +ve x direction ( v → x ),angular position from vertical ( θ → x ) and angular ve-locity ( ω → x ) and is controlled by applying force to-wards left (Action 0) or right (Action 1) to the cart (Fig-ure A.1a). The objective is to balance the inverted pendulum(i.e. −

24 deg ≤ θ ≤

24 deg ) while also ensuring that thecart doesn’t fall off from the platform (i.e. − . ≤ x ≤ . ).For every time step, a reward value of 1 is received while θ is within ±

24 deg . The maximum episode length is set to200 time steps. A deep neural network (DNN) controller istrained on the

CartPole environment using the PPO algo-rithm (Schulman et al. 2017). (a) CartPole environment. (b) CarFollowing environment.(c) LunarLander environment.

Figure A.1: Other three control problems.

B.2 CarFollowing Environment

As mentioned in the main paper, we have developed a dis-cretized version of the car following problem discussedin (Nageshrao, Costa, and Filev 2019) (illustrated in Fig-ure A.1b), wherein the task is to follow the car in thefront which moves with a random acceleration proﬁle (be-tween − m/s and +1 m/s ) and maintain a safe distanceof d safe = 30 m from it. The rear car is controlled usingtwo discrete acceleration values of +1 m/s (Action 0) and − m/s (Action 1). The car-chase episode terminates whenthe relative distance d rel = x front − x rel is either zero(i.e. collision case) or is greater than 150 m. At the startof the simulation, both the cars start with the initial veloc-ity of zero. A DNN policy for CarFollowing problem wasobtained using a double Q-learning algorithm (Van Hasselt,Guez, and Silver 2015). The reward function for the Car-Following problem is shown in Figure A.2, indicating that arelative distance close to 30 m produces the highest reward.Figure A.2: Reward function for CarFollowing environment.It is to note here that unlike the CartPole control prob-lem, where the dynamics of the system was deterministic,the dynamics of the CarFollowing problem is not determin-istic due to the random acceleration proﬁle with which thecar in the front moves. This randomness introduced by theunpredictable behaviour of the front car makes this problemmore challenging.10able A.1: Differences between open-loop and closed-loop optimization problems.Entity Open-loop Optimization Closed-loop OptimizationGoal Find each rule-structure f j ( x ) one at a timefrom root node j = 0 Find overall NLDT simultaneouslyVariables Nonlinear structure B ij for i -th term for ev-ery j -th rule, coefﬁcients w ij , and biases θ j Coefﬁcients w ij and biases θ j for all rules ( j )in the NLDTEach training data State-action pair ( x t - a t ) for each time-instance t Randomly initialized M Episodes compris-ing of state-action-reward triplets ( x t - a t - r t ,for t = 1 , . . . , T ) for each simulationObjective function Weighted Gini-score (mismatch in actions) Average cumulative reward valueOptimization method Bilevel optimization: Upper-level by cus-tomized evolutionary algorithm and lower-level by regression Single-level genetic algorithmTermination condition Upper level (Change in ﬁtness < . forconsecutive 5 generations in Upper level GA,with maximum 100 generations).Lower level (Change in ﬁtness < . forconsecutive 5 generations, with maximum 50generations). 30 generationsOutcome NLDT OL NLDT*

B.3 MountainCar Environment

A car starts somewhere near the bottom of the valley andthe goal of the task is to reach the ﬂag post located on theright up-hill with non-negative velocity (Figure A.3). Thefuel is not enough to directly climb the hill and hence a con-trol strategy needs to be devised to move car back (left up-hill), leverage the potential energy and then accelerate it toeventually reach the ﬂag-post within 200 time steps. The carreceives the reward value of − for each time step, until itreaches the ﬂag-post where the reward value is zero. Thecar is controlled using three actions: accelerate left (Action0), do nothing (Action 1) and accelerate right (Action 2) byobserving its state which is given by two state-variables: x position → x and velocity v → x . We use the SARSAalgorithm (Rummery and Niranjan 1994) with tile encodingto derive the black-box AI controller, which is representedin form of a tensor, has a total of , elements.Figure A.3: MountainCar Environment. B.4 LunarLander Environment

This problem is motivated form a classic problem of designof a rocket-controller. Here, the state of the lunar-lander isexpressed with eight state variables, of which six can assumecontinuous real values, while the rest two are categorical, and can assume a Boolean value (Figure A.1c). The ﬁrst sixstate variables indicate the ( x, y ) position, and velocity andangular orientation and angular velocity of the lunar-lander.The two Boolean state variables provides the indication re-garding the left-leg and right-leg contact of lunar-lander withthe ground terrain. The lunar-lander is controlled using fouractions: Action 0 → do nothing , Action 1 → ﬁre left engine ,Action 2 → ﬁre main engine and Action 3 → ﬁre right en-gine . The black-box DNN based controller for this problemis trained using the PPO algorithm (Schulman et al. 2017)and involves two hidden layers of 64 nodes. C Additional Results

Here, we present the additional results and one of the ﬁ-nal NLDT*s obtained by our overall approach. The param-eter settings used to train NDLTs (and other black-box AIagents) are provided in Section E.

C.1 CartPole Problem

The NLDT OL obtained for the CartPole environment isshown in Figure A.4 in terms of normalized state variablevector (cid:98) x .Figure A.4: CartPole NLDT OL induced using 10,000 train-ing samples. It is 91.45% accurate on the trainig datasetbut has 100% closed loop performance. Normalization con-stants are: x min = [-0.91, -0.43, -0.05, -0.40], x max = [1.37,0.88, 0.10, 0.45].The respective policy is stated as follows:11 f (cid:12)(cid:12)(cid:12) − . (cid:99) x (cid:99) x − − . (cid:99) x − + 0 . (cid:12)(cid:12)(cid:12) − . ≤ then Action = 0 else

Action = 1A little manipulation will reveal that for a correct controlstartegy, Action 0 must be invoked if following condition istrue: . ≤ (cid:18) (cid:99) x (cid:99) x + 3 . (cid:99) x (cid:19) ≤ . , otherwise, Action 1 must be invoked. First, notice that theabove policy does not require the current velocity ( (cid:99) x ) to de-termine the left or right action movement. Second, for smallvalues of angular position ( x ≈ ) and angular velocity( x ≈ , meaning that the pole is falling towards left, theabove condition is always true. That is, the cart should bepushed towards left, thereby trying to stabilize the pole tovertical position. On the other hand, if the pole is falling to-wards right (large values of x ≈ and x ≈ , the term inbracket will be smaller than 2.39 for all (cid:99) x ∈ [1 , , and theabove policy suggests that Action 1 (push the cart towardsright) must be invoked. When the pole is falling right, a pushof the cart towards right helps to stabilize the pole towardsits vertical position. These extreme case analyses are intu-itive and our policy can be explained for its proper working,but what our NLDT approach is able to ﬁnd is a precise rulefor all situations of the state variables to control the Cart-Pole to a stable conﬁguration, mainly using the AI-blackboxdata. C.2 CarFollowing Problem

The NLDT OL obtained for the CarFollowing problem isshown in Figure A.5. The rule-set is provided in its naturalFigure A.5: NLDT OL for the CarFollowing problem. Nor-malization constants are: x min = [0.25, -7.93, -1.00], x max = [30.30, 0.70, 1.00].if-then-else form, as follows:Recall that the physical meaning of state variables is: x → d rel (relative distance between front car and rear car), x → v rel (relative velocity between front car and rear car)and x → a (acceleration value ( − or +1 m/s ) of theprevious time step). Action = 1 stands for acceleration andAction = 0 denotes deceleration of the rear car in the nexttime step. if . (cid:99) x − . (cid:99) x − (cid:99) x − . ≤ thenif . (cid:99) x − − . (cid:99) x + 1 . ≤ then Action = 1 else

Action = 0 else

Action = 1From the ﬁrst rule (Node 0), it is clear that if the rear caris close to the front car ( (cid:99) x ≈ ), the root function f ( x ) isnever going to be positive for any relative velocity or pre-vious acceleration of the rear car (both (cid:99) x and (cid:99) x lying in[1,2]). Thus, Node 4 (Action = 1, indicating acceleration ofthe rear car in the next time step) will never be invoked whenthe rear car is too close to the front car. Thus for (cid:99) x ≈ , thecontrol always passes to Node 1. A little analysis will alsoreveal that for (cid:99) x ≈ , the rule f ( x ) > for any relativevelocity (cid:99) x ∈ [1 , . This means that when the two cars arerelatively close, only Node 3 gets ﬁred to decelerate (Action= 0) the rear car. This policy is intuitively correct, as the onlyway to increase the gap between the cars is for the controlledrear car to be decelerating.However, when the rear car is far way maintaining a dis-tance of about x max0 = 30 . m for which (cid:99) x ≈ , Action 1(Node 4) gets ﬁred if (cid:99) x > . √ (cid:99) x . If the rear car wasdecelerating in the previous time step (meaning (cid:99) x = 1 ), theobtained NLDT* recommends that the rear car should ac-celerate if (cid:99) x ∈ [1 . , , or when the magnitude of the rel-ative velocity is small. or when x ∈ [ − . , . m/s .This will help maintain the requisite distance between thecars. On the other hand, if the rear car was already accelerat-ing in the previous time step ( (cid:99) x = 2 ), Node 4 does not ﬁre,as (cid:99) x can never be more than . √ and the control goesto Node 1 for another check. Thus, the rule in Node 0 makesa ﬁne balance of the rear car’s movement to keep it a safedistance away from the front car, based on the relative ve-locity, position, and previous acceleration status. When thecontrol comes to Node 1, Action 1 (accleration) is invokedif (cid:99) x ≥ . / (0 . (cid:99) x − . For (cid:99) x ≈ , this happens when (cid:99) x > . (meaning that when the magnitude of the rela-tive velocity is small, or x ∈ [ − . , . m/s ), the rearcar should accelerate in the next time step. For all other neg-ative but large relative velocities x ∈ [ − . , . m/s ),meaning the rear car is rushing to catch up the front car, therear car should decelerate in the next time step. From the AI-blackbox data, our proposed methodology is able to create asimple decision tree with two nonlinear rules to make a pre-cise balance of movement of the rear car and also allowingus to understand the behavior of a balanced control strategy. C.3 MountainCar Problem

The NLDT OL obtained for the MountainCar problem isshown below in Figure A.6.Respective rules are stated in if-then-else statements:This rule-set corresponds to the plot shown in Figure 1bof the main paper. A detail analysis of the two rules can be12igure A.6: NLDT OL for MountainCar problem. Normal-ization constants: x min = [-1.20, -0.06], x max = [0.50, 0.06]. if (cid:12)(cid:12)(cid:12) − . (cid:99) x (cid:99) x + 0 . (cid:99) x − − . (cid:99) x − + 0 . (cid:12)(cid:12)(cid:12) − . ≤ thenif (cid:12)(cid:12)(cid:12) − . (cid:99) x − . (cid:99) x + 1 . (cid:12)(cid:12)(cid:12) − . ≤ then Action = 2 else

Action = 1 else

Action = 0made to have a deeper understanding of the control policy.

C.4 LunarLander Problem

One of the NLDT OL s induced using the open-loop super-vised training is shown in Figure A.7. The performance ofthis NLDT OL was presented in the main paper. It has adepth of 6 and it involves a total of 26 rules. The ﬁgure alsoshows how this 26-rule NLDT OL can be pruned to smallersized NLDTs (such as, NLDT-5, NLDT-4, NLDT-3, NLDT-2) starting from the root node. A compilation of results cor-responding to these trees regarding their closed-loop perfor-mance before and after re-optimizing them using the closed-loop training is shown in Table 6 of the main paper. The mainpaper has also presented a four-rule NLDT*-3 obtained by aclosed-loop training of the above NLDT-3.To demonstrate the efﬁcacy and repeatability of our pro-posed approach, we perform another run of the open-loopand closed-loop training and obtain a slightly differentNLDT*-3, which is shown in Figure A.8. This NLDT alsohas four rules, which are shown in Table A.2. Four rulesrules at the pruned NLDT ( P ) OL (Depth 3) are also shown inthe table for a comparison. It can be noticed that the re-optimization of NLDT through closed-loop training (Sec-tion 5.2 in main paper)) modiﬁes the values of coefﬁcientsand biases, however the basic structure of all four rules re-mains intact.Figure A.9 shows the closed-loop training curve for gen-erating NLDT* from Depth-3 NLDT ( P ) OL . The objective isto maximize the closed-loop ﬁtness (reward) F CL (Eq. 7 ofthe main paper) which is expressed as the average of the cu-mulative reward R e collected over M episodes. It is evidentthat the best-population reward climbs to maximum possible reward of 200 at 25-th generation and the average reward ofthe population also catches up the best reward value withgenerations.A visualization of the real-time closed-loop performanceobtained using this new NLDT (Figure A.8) for two dif-ferent rule-sets (i.e. before applying re-optimization and af-ter applying the re-optimization) is shown in the video at –https://youtu.be/DByYWTQ6X3E. It can be observed in thevideo that the closed-loop control executed using the Depth-3 NLDT ( P ) OL comprising of rules directly obtained from theopen-loop training (i.e. without any re-optimization) makesthe LunarLander comes close to the target nicely, but hov-ers above the land and does not land it in most occasions,thereby terminating an episode after the ﬂight-time runsout. On the other hand, the Depth-3 NLDT* comprising ofrule-sets obtained after re-optimization through closed-looptraining is able to successfully come close to the landingbase and lands the LunarLander. A comparison between theoracle DNN and NLDT* is also provided at the end of thevideo. DNN is able to execute the control task, but in somecases it is not able to land the LunarLander properly and hasabout 5000 parameters. On the other hand NDLT* has only4 simple non-linear rules and is able to execute the controltaks efﬁciently. D Computing Infrastructure

For open-loop training, 10 runs are performed in parallelusing Python’s multiprocessing module on a 56 cores In-tel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz. For closed-loop training, only single run is performed. We distributethe population pool across 50 cores of the above mentionedmachine to evaluate population members in parallel manner.

E Parameter Settings

E.1 Open-loop Training

For NLDT open-loop training, we used the default param-eter setting as prescribed in (Dhebar and Deb 2020) apartfrom the population size for upper-level GA which in ourcase, we have set it to 10 for all problems. The lower leveloptimization was done using the implementation of a realcoded genetic algorithm (RGA) from a Python package py-moo: Multi-objective Optimization in Python (Blank andDeb 2020).

E.2 Closed-loop Training

We use RGA implementaion from pymoo (Blank and Deb2020) to do the closed-loop training. The parameter settingwe used is mentioned below • Population size: different for NLDTs of different depths.See Table A.3 for details. • Initialization: Random and seeded with one populationmember with the coefﬁcients and bias values correspond-ing to the parent NLDT OL . • Crossover: Simulated Binary Crossover (Deb andAgrawal 1995). η c = 3 and p c = 0 . .13igure A.7: NLDT-6 (with 26 rules) and other lower depth NLDTs for the LunarLander problem. Lower depth NLDTs areextracted from the depth-6 NLDT. Each node has an associated node-id (on top) and a node-class (mentioned in bottom withinparenthesis). Table 6 in main paper provides results on closed-loop performance obtained using these trees before and after applying re-optimization on rule-sets using the closed-loop training procedure.Figure A.8: Topology of Depth-3 NLDT ( P ) OL obtained from adifferent run on the LunarLander problem. The equationscorresponding the conditional-nodes before and after re-optimization are provided in Table A.2. • Mutation: Polynomial mutation (Deb, Sindhya, and Ok-abe 2007). η m = 5 and p m = 1 /n vars • Selection: Binary tournament selection (Goldberg andDeb 1991).

E.3 Black-box RL Algorithms

CartPole and LunarLander Problems:

For CartPole andLunarlander problem, we use an implementation of the prox-imal policy gradient algorithm (PPO) (Schulman et al. 2017)from https://github.com/nikhilbarhate99/PPO-PyTorch withits default parameter setting other than maximum episodes,which in our case is set to 2000. R e w a r d Closed-Loop Training

Best RewardAvg. Reward

Figure A.9: Closed-loop training plot for ﬁnetuning the rule-set corresponding to depth-3 NLDT ( P ) OL (Table A.2) to obtainNLDT* for LunarLander problem. MountainCar Problem:

We use an implementa-tion of SARSA algorithm (Rummery and Niranjan1994) based on tile encoding (Sutton 1996) from https://github.com/amohamed11/OpenAIGym-Solutions with its default parameter setting.

CarFollowing Problem:

We implemented double deepQ-learning algorithm (Van Hasselt, Guez, and Silver 2015)using Pytorch. Following parameter setting was used • Maximum episodes = 400 • Batch Size = 3214able A.2: NLDT rules before and after the closed-loop training for LunarLander problem, for which NLDT* is shown inFigure A.8. Video at https://youtu.be/DByYWTQ6X3E shows the simulation output of the performance of NLDTs with rule-sets mentioned in this table. Respective minimum and maximum state variables are x min = [-0.38, -0.08, -0.80, -0.88, -0.42,-0.85, 0.00, 0.00], x max = [0.46, 1.52, 0.80, 0.50, 0.43, 0.95, 1.00, 1.00], respectively. Node Rules before Re-optimization (Depth-3 NLDT ( P ) OL ) (cid:12)(cid:12)(cid:12) − . (cid:99) x (cid:99) x − (cid:99) x − (cid:99) x − − . (cid:99) x − (cid:99) x − . (cid:99) x − (cid:99) x − (cid:99) x + 0 . (cid:12)(cid:12)(cid:12) − . . (cid:99) x − − . (cid:99) x (cid:99) x − + 0 . (cid:99) x − (cid:99) x − (cid:99) x − + 0 . . (cid:99) x − + 0 . (cid:99) x − (cid:99) x (cid:99) x − − . (cid:99) x − − . (cid:12)(cid:12)(cid:12) − . (cid:99) x − (cid:99) x − (cid:99) x − . (cid:99) x (cid:99) x − (cid:99) x − + 1 . (cid:99) x (cid:99) x − − . (cid:12)(cid:12)(cid:12) − . Node Rules after Re-optimization (Depth-3 NLDT*) (cid:12)(cid:12)(cid:12) − . (cid:99) x (cid:99) x − (cid:99) x − (cid:99) x − − . (cid:99) x − (cid:99) x − . (cid:99) x − (cid:99) x − (cid:99) x + 0 . (cid:12)(cid:12)(cid:12) − . . (cid:99) x − − . (cid:99) x (cid:99) x − + 0 . (cid:99) x − (cid:99) x − (cid:99) x − + 0 . . (cid:99) x − + 0 . (cid:99) x − (cid:99) x (cid:99) x − − . (cid:99) x − − . (cid:12)(cid:12)(cid:12) − (cid:0) . × − (cid:1) (cid:99) x − (cid:99) x − (cid:99) x − . (cid:99) x (cid:99) x − (cid:99) x − + 0 . (cid:99) x (cid:99) x − − . (cid:12)(cid:12)(cid:12) − . Table A.3: Population size for Closed-loop training

Depth Population Size ≤

504 755 1006 150 • Learning Rate = 0.01 • (cid:15) (for greedy policy) = 0.9 • Discount factor ( γ ) = 0.9 • Target-net update frequency = 100 • Replay Memory Capacity = 2000 • Number of hidden layers = 2 with ReLU activation func-tions. • Number of hidden nodes per hidden layer = 50.

F Summary of this Document