Maximum Likelihood Constraint Inference from Stochastic Demonstrations
MMaximum Likelihood Constraint Inference fromStochastic Demonstrations
David L. McPherson, Kaylene C. Stocking, S. Shankar SastryFebruary 26, 2021
Abstract
When an expert operates a perilous dynamic system, ideal constraintinformation is tacitly contained in their demonstrated trajectories andcontrols. The likelihood of these demonstrations can be computed, giventhe system dynamics and task objective, and the maximum likelihoodconstraints can be identified. Prior constraint inference work has focusedmainly on deterministic models. Stochastic models, however, can capturethe uncertainty and risk tolerance that are often present in real systemsof interest.This paper extends maximum likelihood constraint inference to stochas-tic applications by using maximum causal entropy likelihoods. Further-more, we propose an efficient algorithm that computes constraint like-lihood and risk tolerance in a unified Bellman backup, allowing us togeneralize to stochastic systems without increasing computational com-plexity.
Optimization-based control (notably, model predictive control) promises intelli-gent behavior [4] even in challenging nonlinear dynamics [14] and even stochas-tic dynamics [21][5]. It’s already impacted industrial practice for decades [16]through model-predictive control, and its recent incarnation as “deep reinforce-ment learning” [14][19] has been promising to revolutionize control again.Yet these optimizations must always begin with a fundamental question:what should the automation optimize? Answering this question is notoriouslydifficult [2] and lies at the core of the so-called ”Value Alignment” problem ofAI-safety.One solution approach is to solve the inverse of optimal control: given near-optimal demonstrations, recover the metric the demonstrator is optimizing [9]After fitting the task specification the objective can be optimized to imitate theexpert behavior [1]. or used to predict human motion [24].Often, inverse optimal control focuses on inferring the objective metric thatis being continuously optimized. However, in parallel to constrained optimal1 a r X i v : . [ ee ss . S Y ] F e b ontrol for safety, algorithms are being extended to infer those hard constraints[3][8][11][15][18]. Chou et al. [7] inferred constraints along the paths that wouldbe low cost but were never observed. This intuition was formalized by Scobeeet al. [18] by translating maximum entropy inverse reinforcement learning [23]to work for hard constraints. Unfortunately, the maximum entropy used in [18]only works for deterministic systems.Non-deterministic dynamics reflect the uncertainty that exacerbates thesafety risks beyond merely avoiding bad controls to actively combating diffusioninto unsafety. Stochasticity can model a variety of unpredictable dynamics inapplications: from unpredictable power sources in renewable power systems [10]to hard-to-model turbulence in road conditions [21], from tumor cell growth incancer treatment [17] to unforeseen changes in stormwater reservoirs [6].The maximum entropy likelihoods can be extended to uncertain transitiondynamics by conditioning the entropy at each time step only on the previouslyrevealed state transitions [22]. This maximum causal entropy has been appliedto learn general non-Markovian specifications [20]. A subclass of non-Markovianspecifications are state-action constraints that proscribe that a configuration orcontrol should never be taken. This paper specializes the inference to just thissub-class paralleling the approach of [18] but applying to causal entropy instochastic dynamics. This work advances prior art [18] in inferring state-action constraints: • by respecting causality using the principle of maximum causal entropyfor likelihood generative models • by extending the hypothesis family to include risk-tolerating chance con-straints • and by streamlining the algorithm into one backwards pass, thereby main-taining the same computational complexity as the non-stochastic version[18] The inference task is to fit parameters to observed demonstration trajectories.We follow the maximum likelihood inference framework that models data assamples of a parametrized distribution. In our case, the distribution can bederived from the task description and agent model. These will be describedin subsections 2.b and 2.d, respectively. The parameters to fit will be whichcandidate constraints C ∈ C are effecting the expert and at what risk tolerance α C . These chance constraints will be described in subsection 2.c.2 .b Markov Decision Processes In general, a stochastic process is any indexed collection of random variables.For discrete time processes, these variables are indexed by integers t ∈ [0 : T ]. Adiscrete-time Markov process X t ∈ X , t ∈ [0 : T ] is one with the discrete-timeMarkov property: P ( X t | X t − , X t − , . . . , X ) = P ( X t | X t − ) , ∀ t > a t ∈ A . This conditional transition distribution cantherefore (assuming that the state space X is a measurable space) be capturedby a probability density function parametrized by previous state x t − and chosenaction a t − : P a t − ( X t = x t | X t − = x t − ) = S ( x t − , a t − , x t ) (2)These actions are then evaluated by a metric on the state and action se-quences into some scalar reward or penalty, R ( x T , a T − ): R ( x [0: T ] , a [0: T − ) = T − (cid:88) t =0 r ( x t , a t ) + w ( x T ) (3)where r is the running cost and w is the final cost.Bundling these four objects of the: • state space X , • set of actions A , • transition distribution function P a t − ( x t | x t − ), • and objective metric R ( x [0: T ] , a [0: T − )makes up a 4-tuple that defines the Markov Decision Process.This process’ distribution on reward outcomes induced by certain actionchoices becomes the foundation of stochastic control problems in discrete spaces.This paper models the expert’s performance as operating in this discrete spaceunder additional constraints. We describe these additional constraints next.3 .c Chance Constraints The agent must also choose its actions to avoid dangerous states x ∈ C X . Tomodel some risk-tolerance, we allow some small probability ψ ( x ) of transitioningto an x ∈ C X : P ( X t +1 = x | X t = x t , a t ) ≤ ψ ( x ) , ∀ x ∈ C X (4)To deterministically constrain out a state x set ψ ( x ) = 0. On the otherhand, setting ψ ( x ) = 1 means the constraint is inactive and transitioning to x is freely allowed. Therefore the set of state constraints C X can be encoded as a ψ ( x ) over all states x ∈ X .There can also be constraints on action that rule out illegal actions a ∈ C A .Since the stochastic dynamics only make states uncertain, we only need chanceconstraints on the states and not on the actions.Let C be the tuple ( ψ ( x ) , C A ), and call the set of all these constraint candi-dates C , so C ∈ C .Let the set W tC ( x t ) be the set of actions a t that satisfy C A from x t andgenerate state transitions that satisfy Eq. (4). And let Φ tC ( a, x ) be its indicator:the indicator of whether Eq. (4) and a / ∈ C A are satisfied:Φ tC ( a, x ) == I [ a / ∈ C A ∩ ( P ( X t +1 = ˆ x | X t = x, a ) ≤ ψ (ˆ x ) ∀ ˆ x ∈ C X )] (5) All the probabilities in the previous section were based purely on process noisethat made X t random given a deterministic choice of controllers a t for t ∈ [0 , T ].The chance constraints outlaw certain induced distributions over X t given a t − .Inside these chance constraints, the agent chooses actions. There are manypossible ways an agent might choose their actions. We assume they are endeav-ouring to optimize the reward function (cid:80) t r ( X t , a t ) + w ( X T ) defined in Eq.(3) and assume nothing else. The Maximum Entropy method optimizes thegenerative model distribution P C ( a | x ) to fit this known constraint-set C andleave all other facets maximally agnostic [22]. This distribution layers on an-other layer of stochasticity purely on the level of selecting controller sequences A t ∀ t ∈ [0 : T ]. Ziebart [22, p. 74] finds the maximal causal entropy distribu-tion to be compactly defined by a backwards iteration with close analogues tothe Bellman backup only with the max exchanged for a differentiable approxi-mation softmax: 4 C ( a t | x t ) = e Q softC,t ( a t ,x t ) e V softC,t ( x t ) Φ tC ( a, x ) (6) Q softC,t ( a t , x t ) = r ( x t , a t ) + E X t +1 V softC,t +1 ( x t +1 ) (7) V softC,t ( x t ) = log (cid:88) a t ∈ W tC e Q softC,t ( a t ,x t ) (8)= soft max a t ∈ W tC Q softC,t ( a t , x t ) (9)where Q soft can be interpreted as a state-action soft-optimal value-to-go and V soft the state’s soft-optimal value-to-go.Note that these parallels are coincindental as these recursions actually derivefrom tracking normalizing constants on causally normalized distributions.These “soft Bellman” distributions in Equation (6) over single-timestep ac-tions P ( a t | x t ) can form a joint distribution over horizon-wide sequences of con-trollers: P C ( A [ t : T ] = a [ t : T ] | X t = x t ) (10)= e E [ R ( X [ t : T ] ,a [ t : T ]) ] e V softC,t ( xt ) , if a [ t : T ] ∈ W [ t : T ] C , if a [ t : T ] / ∈ W [ t : T ] C (11) This full-horizon-wide formulation of the action distribution is useful for relatingthis generative behavior model to the empirical expert demonstrations. Letthe cross product of the W tC + ( · ) for all time points will be W [0: T ] C + : the set ofcontroller sequences a T ( · ) that satisfy C A for all possible input states andgenerate state transitions that satisfy Eq. (4) for all time. The demonstrationsmust be within W [0: T ] C + . So when inferring constraints, we can instantly ruleout any constraint-sets C + that place the demonstrations outside of its W [0: T ] C + .Amongst the remaining constraint-set candidates C + ∈ C we choose the one thatmaximizes the likelihood of observing the demonstrations. Indeed, across alldemonstrations each likelihood rescales by a F C + ( x t ) unique to each candidateconstraint: 5 C + ( A [ t : T ] = a [ t : T ] | X t = x t ) (12)= e E [ R ( X [ t : T ] ,a [ t : T ] ) ] e V softC + ,t ( x t ) (13)= e E [ R ( X [ t : T ] ,a [ t : T ] ) ] e V softC ,t ( x t ) e V softC ,t ( x t ) e V softC + ,t ( x t ) (14)= P C + ( A [ t : T ] = a [ t : T ] | X t = x t ) 1 F C + ,t ( x t ) (15)where we’ve set F C + ,t ( x t ) to: F C + ,t ( x t ) = e V softC + ,t ( x t ) e V softC ,t ( x t ) (16)Therefore the likelihoods can be computed for just one constrained optimalcontrol problem, call it C , and then readily translated into the likelihoods forall constraints under consideration. Prior art [18] bounded the sub-optimalityof inferring the constraints incrementally by adding one state to C X or actionto C A per step. Amongst all these candidate C + , whichever has the lowest F C + , ( x ) (assessed at the start x ) will have the highest likelihood for all thedemonstrations and be the maximum likelihood constraint. Property.
Consider two candidate constraints C + and C + −− that differ only by C + −− having exactly one ψ ( x ) lower than C + has. C + −− will always have F C + −− , ( x ) ≤ F C + , ( x ). Corollary 3.0.1.
When considering adding a single state x into C X alwayschoose the lowest possible ψ ( x ) that doesn’t rule out any demonstrations. The ratio defined in Equation (16) can be computed by modifying the softBellman backup defined in Equations (6) - (8). This modified backup procedureis described in the below theorem:
Theorem 4.1.
Let C be a set of constraints and C + be an augmented versionof C so that W C + ⊂ W C . Then any F C + ,t ( x t ) can be computed with the samesums as for the base C F C + ,t ( x t )= E a t ∼ P C (cid:104) Φ tC + ( a t , x t ) e E xt +1 log ( F C + ,t +1 ( x t +1 ) ) (cid:105) roof. F C + ,t ( x t ) = e V softC + ,t ( x t ) e V softC ,t ( x t ) = (cid:80) a t ∈ W tC + e Q softC,t ( a t ,x t ) e V softC ,t ( x t ) = (cid:88) a t ∈ W tC + e Q softC,t ( a t ,x t ) e V softC ,t ( x t ) = (cid:88) a t ∈ W tC + e r ( x t ,a t )+ E xt +1 V softC + ,t +1 ( x t +1 ) e V softC ,t ( x t ) It will be convenient to define the logarithm of our F C,t . Let it be ∆ tC :∆ t +1 C + ( x t +1 ) = log (cid:0) F C + ,t +1 ( x t +1 ) (cid:1) = log e V softC + ,t +1 ( x t +1 ) e V softC ,t +1 ( x t +1 ) = log (cid:18) e V softC + ,t +1 ( x t +1 ) − V softC ,t +1 ( x t +1 ) (cid:19) = V softC + ,t +1 ( x t +1 ) − V softC ,t +1 ( x t +1 )Then the ratio can be redefined in terms of previously calculated terms on C and our iterating F C,t F C + ,t ( x t ) = (cid:88) a t ∈ W tC + e r ( x t ,a t )+ E xt +1 V softC ,t +1 ( x t +1 ) e V softC ,t ( x t ) · e E xt +1 ∆ t +1 C + ( x t +1 ) = (cid:80) a t ∈ W tC + e Q ( x t ,a t )+ E xt +1 ∆ t +1 C + ( x t +1 ) e V softC ,t ( x t ) = (cid:88) a t ∈ W tC + e Q ( x t ,a t ) e V softC ,t ( x t ) e E xt +1 ∆ t +1 C + ( x t +1 ) = (cid:88) a t ∈ W tC + P C ( a t | x t ) e E xt +1 ∆ t +1 C + ( x t +1 ) = E a t ∼ P C Φ tC + ( a t , x t ) e E xt +1 ∆ t +1 C + ( x t +1 ) (17)7 Algorithm
Theorem 4.1 implies that an algorithm can compute the conversion ratios F C ( x )(which will correspond to how much the distribution shrunk by) for all candi-date constraints at the same time as the Bellman backup for the baseline setof constraints C . The Greedy Iterative Constraint Inference procedure pio-neered in [18] suggests this selection can be performed iteratively adding justone constraint at a time. This iterative approach can be shown to be boundedsub-optimal compared to selecting all the constraints simultaneously [18]. Inthis iterative approach, the F C + optimizing C + will become the baseline set ofconstraints for the next iteraiton C i . Corollary 3.0.1 states that this set of candidates can be further reduced to onlythose whose newly added ψ ( x ) are as exclusive as possible without excluding anyof the demonstrations (˜ x T , ˜ a T ) ∈ D . That is, when adding state constraints,the newly added exclusion threshold ψ ( x ) must as low as possible while stillbeing greater than all transition probabilities to x that were chosen by theexpert in their demonstrations. For simplicity, we will bound lowerbound ψ ( x )to prevent any precursor states of x from having all its available actions ruledout thereby dooming any trajectory entering that precursor state to necessarilyviolate the chance constraint on x . Therefore this lowerbound Ψ( x ) must bedefined:Ψ( x (cid:48) ) = max { x |∃ ˆ a (cid:51) P ( x (cid:48) | x, ˆ a ) > } min a P ( x (cid:48) | x, a ) (18)This implies that the new ψ ( x ) should be: ψ ( x ) = max (cid:26) max (˜ x T , ˜ a T ) ∈D max t ∈ [0: T − S (˜ x ( t ) , ˜ a ( t ) , x ) , Ψ( x ) (cid:27) (19)The most likely constraint is then whichever one still allows the observeddemonstrations and has the smallest normalizing constant from the startingstate F C, ( x ).Note that Algorithm 1 has computations on the order of O ( |X | ( |X | + |A| ),identical to the computational complexity of prior art in maximum likelihoodconstraint inference [18]. Algorithm 1 was implemented in MATLAB and tested on a synthetic datasetof 100 demonstrations. This dataset was synthesized from simulated trajecto-8 lgorithm 1:
Modified Bellman Backup with Value Ratio
Result: V C ,t and a column vector F where each entry corresponds tothe F C + , that adds one state/action constraint on top of C + ∈ C + (cid:48) for x ∈ X do Z ( T, x ) ← exp( w ( x )) F ( T, x ) ← end for t ∈ [ T − , do for x ∈ X do Z ( t, x ) ← F ( t, x ) ← for a ∈ A do Q ( t, x, a ) ← r ( x, a ) D ( t, x, a ) ← for x (cid:48) ∈ X do Q ( t, x, a )+ = S ( x, a, x (cid:48) ) log( Z ( t + 1 , x (cid:48) )) D ( t, x, a )+ = S ( x, a, x (cid:48) ) log( F ( t + 1 , x (cid:48) )) end Z ( t, x )+ = Φ C ( x, a ) exp( Q ( t, x, a )) F ( t, x )+ = Φ C ∈C ( x, a ) exp( Q ( t, x, a )) exp( D ( t, x, a )) end F ( t, x ) = F ( t, x ) /Z ( t, x ) end end -4-3.5-3-2.5-2-1.5 -7.5-7-6.5-6-5.5-5-4.5-4-3.5-3 -10-9.5-9-8.5-8-7.5-7-6.5-6-5.5-5 -5.57-5.568-5.566-5.564-5.562-5.56-5.558-5.556-5.554-5.552-5.55 -0.042-0.04-0.038-0.036-0.034-0.032-0.03-0.028-0.026-0.024 -0.04-0.038-0.036-0.034-0.032-0.03-0.028-0.026-0.024-0.022 l og F C + / C ( x ) (A) TRUE MDP WITH DEMONSTRATIONS’ ψ ( x ) l og F C + / C ( x ) l og F C + / C ( x ) l og F C + / C ( x ) l og F C + / C ( x ) l og F C + / C ( x ) l og F C + / C ( x ) C H A N C E L E V E L S ψ ( x ) (B1) CHOOSING CONSTRAINT 1(B2-4) ACCUMULATING CONSTRAINTS (B5-7) DIMINISHING RETURNS Figure 1: The constraint inference algorithm was evaluated on a gridworldsynthetic dataset with stochastic dynamics. In the panels, X marks representconstraints. (A) The demonstrator’s true set of constraints to be inferred. Cellshading indicates whether that state can be considered as a constraint given thedemonstration set. State cell shading represents the ψ ( x ) allowed by Equation(19). Action cell shading indicates the binary indicator of whether that actionwas sampled in any of the demonstrations. (B) The sequence of inferred con-straints alongside the value of adding other candidate constraints in addition.Cell shading corresponds to how much further partition mass would be elimi-nated by introducing a constraint on that action or chance constraint on thatstate (taking the log of the ratio for distinguishability). (B1) The value of choos-ing the first constraint over nothing, (B2-4) Adding the first three scales up thedemonstration likelihood drastically, (B5-7) after the fourth added constraintthe continued scaling drops off. The fifth constraint happens to be a true con-straint, but the sixth fails to identify the constraints outside the demonstratedarea instead misselecting a constraint right on its boundary10ies of a stochastically optimal agent minimizing distance traveled on a two-dimensional “Gridworld” MDP with movement in all eight compass directions.These eight directions made up the action space A along with a loitering ter-minal action for once the goal was reached. Each directional action was given afixed “slippage” chance of 0 . ψ = 0 .
25 for all states.The simulated demonstrator only noisily optimized the task, following aBoltzmann choice distribution as described in Equation (11). The constraintinference algorithm was evaluated on this dataset as shown in Figure 1. By thefifth iteration (shown in Figure 1e), the algorithm suceeded in recovering thegroundtruth constraints (shown in Figure 1a).
The algorithms set forth in this paper focused on discretized state and actionspaces. For controlling many systems on practical timescales, the state mustbe handled as a continuous parameter. Future work should investigate howgridded state spaces like in Figure 1 could be refined to approximate continuousstate spaces. Reducing the algorithm to a variant Bellman backup, as we didin Theorem 4.1, suggests that the continuous variant may just be solving aHamilton-Jacobi-Bellman equation. These partial differential equations have arich literature investigating their solution, including toolsets like [13].Extending constraint inference to stochastic systems raises questions of whetherhuman experts might be better modeled using a prospect-theoretic or risk-sensitive measure as in [12]. Future work should investigate how human heuris-tics for statistical prediction might impact the way demonstrations are gener-ated. The algorithm should be designed to be robust to these biases or evenleverage their structure.
By designing the likelihoods to maximize the causal entropy (that respects theinformation flow of state transition outcome revelation) this work makes max-imum likelihood estimation possible for stochastic dynamics that reflect theuncertainties inherent in perilous situations. Moreover, by broadening the hy-pothesis class to include chance constraints our algorithm not only learns theconstraints from expert operators, but also their risk tolerances. This opens thedoor to studying how expert operators plan risk-sensitively and what prospect-theoretic risk measures they may be employing.Although increasing the complexity of systems that can be handled in con-straint inference, this algorithm maintains the same computational complexityof O ( |X | ( |X | + |A| ) as prior art. That is, control engineers can extract safety11pecifications from expert demonstration data for the same cost in both stochas-tic and deterministic dynamics. References [1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse re-inforcement learning. In
Proceedings of the twenty-first international con-ference on Machine learning , page 1, 2004.[2] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schul-man, and Dan Man´e. Concrete problems in ai safety. arXiv preprintarXiv:1606.06565 , 2016.[3] Leopoldo Armesto, Jorren Bosga, Vladimir Ivan, and Sethu Vijayakumar.Efficient learning of constraints and generic null space policies. In , pages1520–1526. IEEE, 2017.[4] Anil Aswani, Humberto Gonzalez, S Shankar Sastry, and Claire Tomlin.Provably safe and robust learning-based model predictive control.
Auto-matica , 49(5):1216–1226, 2013.[5] Bart van den Broek, Wim Wiegerinck, and Hilbert Kappen. Risk sensitivepath integral control. arXiv preprint arXiv:1203.3523 , 2012.[6] Margaret P Chapman, Jonathan Lacotte, Aviv Tamar, Donggun Lee,Kevin M Smith, Victoria Cheng, Jaime F Fisac, Susmit Jha, Marco Pavone,and Claire J Tomlin. A risk-sensitive finite-time reachability approach forsafety of stochastic dynamic systems. In , pages 2958–2963. IEEE, 2019.[7] Glen Chou, Dmitry Berenson, and Necmiye Ozay. Learning constraintsfrom demonstrations. In
International Workshop on the Algorithmic Foun-dations of Robotics , pages 228–245. Springer, 2018.[8] Glen Chou, Necmiye Ozay, and Dmitry Berenson. Learning parametricconstraints in high dimensions from demonstrations. In
Conference onRobot Learning , pages 1211–1230. PMLR, 2020.[9] Rudolf Emil Kalman. When is a linear control system optimal?
Journalof Basic Engineering , 86(1):51–60, 1964.[10] Mohammad E Khodayar, Mohammad Shahidehpour, and Lei Wu. Enhanc-ing the dispatchability of variable wind generation by coordination withpumped-storage hydro units in stochastic power systems.
IEEE Transac-tions on Power Systems , 28(3):2808–2818, 2013.1211] Changshuo Li and Dmitry Berenson. Learning object orientation con-straints and guiding constraints for narrow passages from one demonstra-tion. In
International symposium on experimental robotics , pages 197–210.Springer, 2016.[12] Eric Mazumdar, Lillian J Ratliff, Tanner Fiez, and S Shankar Sastry.Gradient-based inverse risk-sensitive reinforcement learning. In , pages 5796–5801.IEEE, 2017.[13] Ian M Mitchell. A toolbox of level set methods.
UBC Department ofComputer Science Technical Report TR-2007-11 , 2007.[14] Andrew Y Ng, H Jin Kim, Michael I Jordan, Shankar Sastry, and ShivBallianda. Autonomous helicopter flight via reinforcement learning. In
NIPS , volume 16. Citeseer, 2003.[15] Claudia P´erez-D’Arpino and Julie A Shah. C-learn: Learning geomet-ric constraints from demonstrations for multi-step manipulation in sharedautonomy. In , pages 4058–4065. IEEE, 2017.[16] S Joe Qin and Thomas A Badgwell. A survey of industrial model predictivecontrol technology.
Control engineering practice , 11(7):733–764, 2003.[17] Tyler Risom, Ellen M Langer, Margaret P Chapman, Juha Rantala, An-drew J Fields, Christopher Boniface, Mariano J Alvarez, Nicholas D Kend-sersky, Carl R Pelz, Katherine Johnson-Camacho, et al. Differentiation-state plasticity is a targetable resistance mechanism in basal-like breastcancer.
Nature communications , 9(1):1–17, 2018.[18] Dexter RR Scobee and S Shankar Sastry. Maximum likelihood con-straint inference for inverse reinforcement learning. arXiv preprintarXiv:1909.05477 , 2019.[19] Niko S¨underhauf, Oliver Brock, Walter Scheirer, Raia Hadsell, Dieter Fox,J¨urgen Leitner, Ben Upcroft, Pieter Abbeel, Wolfram Burgard, MichaelMilford, et al. The limits and potentials of deep learning for robotics.
TheInternational Journal of Robotics Research , 37(4-5):405–420, 2018.[20] Marcell Vazquez-Chanlatte, Susmit Jha, Ashish Tiwari, Mark K Ho, andSanjit Seshia. Learning task specifications from demonstrations. In
Ad-vances in Neural Information Processing Systems , pages 5367–5377, 2018.[21] Grady Williams, Paul Drews, Brian Goldfain, James M Rehg, and Evan-gelos A Theodorou. Aggressive driving with model predictive path integralcontrol. In , pages 1433–1440. IEEE, 2016.1322] Brian D Ziebart.
Modeling purposeful adaptive behavior with the principleof maximum causal entropy . CMU, 2010.[23] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey.Maximum entropy inverse reinforcement learning. In
Aaai , volume 8, pages1433–1438. Chicago, IL, USA, 2008.[24] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey.Human behavior modeling with maximum entropy inverse optimal control.In