Exploiting Submodular Value Functions For Scaling Up Active Perception
Yash Satsangi, Shimon Whiteson, Frans A. Oliehoek, Matthijs T. J. Spaan
AAutonomous Robots: Special Issue on Active Perception manuscript No. (will be inserted by the editor)
Exploiting Submodular Value Functionsfor Scaling Up Active Perception
Yash Satsangi · Shimon Whiteson · Frans A. Oliehoek · Matthijs T. J. Spaan
Received: date / Accepted: date
Abstract In active perception tasks, an agent aims toselect sensory actions that reduce its uncertainty aboutone or more hidden variables. For example, a mobilerobot takes sensory actions to efficiently navigate ina new environment. While partially observable Markovdecision processes (POMDPs) provide a natural modelfor such problems, reward functions that directly pe-nalize uncertainty in the agent’s belief can remove the piecewise-linear and convex (PWLC) property of the value function required by most POMDP planners. Fur-thermore, as the number of sensors available to theagent grows, the computational cost of POMDP plan-ning grows exponentially with it, making POMDP plan-ning infeasible with traditional methods.In this article, we address a twofold challenge ofmodeling and planning for active perception tasks. Weanalyze ρ POMDP and POMDP-IR, two frameworksfor modeling active perception tasks, that restore thePWLC property of the value function. We show themathematical equivalence of these two frameworks byshowing that given a ρ POMDP along with a policy,they can be reduced to a POMDP-IR and an equiva-lent policy (and vice-versa). We prove that the value
Yash SatsangiInformatics Institute, University of AmsterdamE-mail: [email protected].: +31-(0)20-525-8516Fax.: +31-(0)20-525-7490Shimon WhitesonUniversity of Oxford.Frans A. OliehoekUniversity of Liverpool.University of Amsterdam.Matthijs T. J. SpaanDelft University of Technology. function for the given ρ POMDP (and the given policy)and the reduced POMDP-IR (and the reduced policy) isthe same. To efficiently plan for active perception tasks,we identify and exploit the independence properties ofPOMDP-IR to reduce the computational cost of solvingPOMDP-IR (and ρ POMDP). We propose greedy point-based value iteration (PBVI) , a new POMDP planningmethod that uses greedy maximization to greatly im-prove scalability in the action space of an active per-ception POMDP. Furthermore, we show that, undercertain conditions, including submodularity , the valuefunction computed using greedy PBVI is guaranteed tohave bounded error with respect to the optimal valuefunction. We establish the conditions under which thevalue function of an active perception POMDP is guar-anteed to be submodular . Finally, we present a detailedempirical analysis on a dataset collected from a multi-camera tracking system employed in a shopping mall.Our method achieves similar performance to existingmethods but at a fraction of the computational costleading to better scalability for solving active percep-tion tasks.
Keywords
Sensor selection · Long-term planning · Mobile sensors · Submodularity · POMDP
Multi-sensor systems are becoming increasingly preva-lent in a wide-range of settings. For example, multi-camera systems are now routinely used for security,surveillance and tracking (Kreucher et al, 2005; Natara-jan et al, 2012; Spaan et al, 2015). A key challenge inthe design of these systems is the efficient allocationof scarce resources such as the bandwidth required tocommunicate the collected data to a central server, the a r X i v : . [ c s . A I] S e p Satsangi et al.
CPU cycles required to process that data, and the en-ergy costs of the entire system (Kreucher et al, 2005;Williams et al, 2007; Spaan and Lima, 2009). For ex-ample, state of the art human activity recognition al-gorithms require high resolution video streams coupledwith significant computational resources. When a hu-man operator must monitor many camera streams, dis-playing only a small number of them can reduce the op-erator’s cognitive load. IP-cameras connected directlyto a local area network need to share bandwidth. Suchconstraints gives rise to the dynamic sensor selection (Satsangi et al, 2015) problem where an agent at eachtime step, must select K out of the N available sensorsto allocate these resources to, where K is the maxi-mum number of sensors allowed given the resource con-straints.For example, consider the surveillance task , in whicha mobile robot aims to minimize its future uncertaintyabout the state of the environment but can use only K of its N sensors at each time step. Surveillanceis an example of an active perception (Bajcsy, 1988)task, where an agent takes actions to reduce uncer-tainty about one or more hidden variables, while rea-soning about various resource constraints. When thestate of the environment is static, a myopic approachthat always selects actions that maximize the immedi-ate expected reduction in uncertainty is typically suf-ficient. However, when the state changes over time, a non-myopic approach that reasons about the long-termeffects of action selection performed at each time stepcan be better. For example, in the surveillance task,as the robot moves and the state of the environmentchanges, it becomes essential to reason about the long-term consequences of the robot’s actions to minimizethe future uncertainty.A natural decision-theoretic model for such an ap-proach is the partially observable Markov decision pro-cess (POMDP) (Sondik, 1971; Kaelbling et al, 1998;Kochenderfer, 2015). POMDPs provide a comprehen-sive and powerful framework for planning under un-certainty. They can model the dynamic and partiallyobservable state and express the goals of the systemsin terms of rewards associated with state-action pairs. This article extends the research already presented bySatsangi et al (2015) at AAAI 2015. In this article, we presentadditional theoretical results on equivalence of POMDP-IR and ρ POMDP, a new technique that exploits the in-dependence properties of POMDP-IR to solve it more ef-ficiently, and we present a detailed empirical analysis ofbelief-based rewards for POMDPs in active perception tasks.This is a corrected version of this article and contains thesame corrections as pointed in Satsangi et al (2015). Wethank Csaba Szepesvari for pointing this. The original ar-ticle is publicly made available by Springer Journals athttps://doi.org/10.1007/s10514-017-9666-5 .
This model of the world can be used to compute closed-loop, long term policies that can help the agent to de-cide what actions to take given a belief about the stateof the environment (Burgard et al, 1997; Kurniawatiet al, 2010).In a typical POMDP, reducing uncertainty aboutthe state is only a means to an end . For example, arobot whose goal is to reach a particular location maytake sensing actions that reduce its uncertainty aboutits current location because doing so helps it determinewhat future actions will bring it closer to its goal. Bycontrast, in active perception problems reducing uncer-tainty is an end in itself . For example, in the surveil-lance task, the system’s goal is typically to ascertainthe state of its environment, not use that knowledge toachieve a goal. While perception is arguably always per-formed to aid decision-making, in an active perceptionproblem that decision is made by another agent such asa human, that is not modeled as a part of the POMDP.For example, in the surveillance task, the robot mightbe able to detect a suspicious activity but only the hu-man users of the system may decide how to react tosuch an activity.One way to formulate uncertainty reduction as anend in itself is to define a reward function whose ad-ditive inverse is some measure of the agent’s uncer-tainty about the hidden state, e.g., the entropy of its belief . However this formulation leads to a reward func-tion that conditions on the belief, rather than the stateand the resulting value function is not PWLC, whichmakes many traditional POMDP solvers inapplicable.There exists online planning methods (Silver and Ve-ness, 2010; Bonet and Geffner, 2009), which generatespolicy on the fly, that do not require the PWLC prop-erty of the value function. However, many of thesemethods require multiple ‘hypothetical’ belief updatesto compute the optimal policy, which makes them un-suitable for sensor selection where the optimal policymust be computed in a fraction of a second. There ex-ists other online planning methods that do not requirehypothetical belief updates (Silver and Veness, 2010),but since we are dealing with belief based rewards, theycannot be directly applied here. Here, we address thecase of offline planning where the policy is computedbefore execution of the task.Thus, to efficiently solve active perception problems,we must (a) model the problem with minimizing uncer-tainty as the objective while maintaining a PWLC valuefunction and (b) use this model to solve the POMDP ef-ficiently. Recently, two frameworks have been proposed, ρ POMDP (Araya-L´opez et al, 2010) and
POMDP withInformation Reward (POMDP-IR) (Spaan et al, 2015)to efficiently model active perception tasks, such that xploiting Submodular Value Functions for Scaling Up Active Perception 3 the PWLC property of the value function is maintained.The idea behind ρ POMDP is to find a PWLC approx-imation to the “true” continuous belief-based rewardfunction, and then solve it with the traditional solvers.POMDP-IR, on the other hand, allows the agent tomake predictions about the hidden state and the agentis rewarded for accurate predictions via a state-basedreward function. There is no research that examines therelationship between these two frameworks, their prosand cons, or their efficacy in realistic tasks, thus it isnot clear how to choose between these two frameworksto model the active perception problems.In this article, we address the problem of efficientmodeling and planning for active perception tasks.First, we study the relationship between ρ POMDP andPOMDP-IR. Specifically, we establish equivalence be-tween them by showing that any ρ POMDP can be re-duced to a POMDP-IR (and vice-versa) that preservesthe value function for equivalent policies. Having estab-lished the theoretical relationship between ρ POMDPand POMDP-IR, we model the surveillance task as aPOMDP-IR and propose a new method to solve it ef-ficiently by exploiting a simple insight that lets us de-compose the maximization over prediction actions andnormal actions while computing the value function.Although POMDPs are computationally difficultto solve, recent methods (Littman, 1996; Hauskrecht,2000; Pineau et al, 2006; Spaan and Vlassis, 2005;Poupart, 2005; Ji et al, 2007; Kurniawati et al, 2008;Shani et al, 2012) have proved successful in solvingPOMDPs with large state spaces. Solving active per-ception POMDPs pose a different challenge: as thenumber of sensors grows, the size of the action space (cid:0) NK (cid:1) grows exponentially with it. Current POMDPsolvers fail to address scalability in the action spaceof a POMDP. We propose a new point-based planningmethod that scales much better in the number of sen-sors for such POMDPs. The main idea is to replacethe maximization operator in the Bellman optimalityequation with greedy maximization in which a subsetof sensors is constructed iteratively by adding the sen-sor that gives the largest marginal increase in value.We present theoretical results bounding the errorin the value functions computed by this method. Weprove that, under certain conditions including submod-ularity , the value function computed using POMDPbackups based on greedy maximization has boundederror. We achieve this by extending the existing results(Nemhauser et al, 1978) for the greedy algorithm, whichare valid only for a single time step, to a full sequen-tial decision making setting where the greedy operatoris employed multiple times over multiple time steps. Inaddition, we show that the conditions required for such a guarantee to hold are met, or approximately met, ifthe reward is defined using negative belief entropy.Finally, we present a detailed empirical analysis ona real-life dataset from a multi-camera tracking systeminstalled in a shopping mall. We identify and study thecritical factors relevant to the performance and behav-ior of the agent in active perception tasks. We show thatour proposed planner outperforms a myopic baselineand nearly matches the performance of existing point-based methods while incurring only a fraction of thecomputational cost, leading to much better scalabilityin the number of cameras. Sensor selection as an active perception task has beenstudied in many contexts. Most work focus on eitheropen-loop or myopic solutions, e.g., (Kreucher et al,2005), (Spaan and Lima, 2009), (Williams et al, 2007),(Joshi and Boyd, 2009). Kreucher et al (2005) pro-poses a Monte-Carlo approach that mainly focuses ona myopic solution. Williams et al (2007) and Joshi andBoyd (2009) developed planning methods that can pro-vide long-term but open-loop policies. By contrast, aPOMDP-based approach enables a closed-loop, non-myopic approach can lead to a better performancewhen the underlying state of the world changes overtime. Spaan (2008), Spaan and Lima (2009), Spaanet al (2010) and Natarajan et al (2012) also considera POMDP-based approach to active and cooperativeactive perception. However, they consider an objectivefunction that conditions on state and not on belief,as the belief-dependent rewards in POMDP break thePWLC property of the value function. They use point-based methods (Spaan and Vlassis, 2005) for solving thePOMDPs. While recent point-based methods (Shaniet al, 2012) for solving POMDPs scale reasonably in thestate space of POMDPs, they do not address the scala-bility in the action and observation space of a POMDP.Greedy PBVI focuses specially on the scalability inthe action space of an active perception POMDP andprovides better scalability by leveraging greedy max-imization. Traditionally, POMDPs require the rewardfunction to be defined as a function of the state. How-ever, for active perception POMDPs, the objective is toreduce the uncertainty in the belief of the agent.In recent years, applying greedy maximization tosubmodular functions has become a popular and ef-fective approach to sensor placement/selection (Krauseand Guestrin, 2005, 2007; Kumar and Zilberstein,2009). However, such work focuses on myopic or fullyobservable settings and thus does not enable the long-
Satsangi et al. term planning required to cope with dynamic state ina POMDP.
Adaptive submodularity (Golovin and Krause, 2011)is a recently developed extension that addresses theselimitations by allowing action selection to conditionon previous observations. However, it assumes a staticstate and thus cannot model the dynamics of a POMDPacross timesteps. Therefore, in a POMDP, adaptivesubmodularity is only applicable within a timestep, dur-ing which state does not change but the agent can se-quentially add sensors to a set. In principle, adaptivesubmodularity could enable this intra-timestep sequen-tial process to be adaptive, i.e., the choice of later sen-sors could condition on the observations generated byearlier sensors. However, this is not possible in our set-ting because (a) we assume that, due to computationalcosts, all sensors must be selected simultaneously; (b)information gain is not known to be adaptive submod-ular (Chen et al, 2015). Consequently, our analysis con-siders only classic, non-adaptive submodularity.To our knowledge, our work is the first to es-tablish the sufficient conditions for the submodular-ity of POMDP value functions for active perceptionPOMDPs and thus leverage greedy maximization toscalably compute bounded approximate policies for dy-namic sensor selection modeled as a full POMDP.
In this section, we provide background on POMDPs,active perception POMDPs and solution methods forPOMDPs.3.1 Partially Observable Markov Decision ProcessesPOMDPs provide a decision-theoretic framework formodeling partial observability and dynamic environ-ments. Formally, a POMDP is defined by a tuple (cid:104)
S, A, Ω, T, O, R, b , h (cid:105) . At each time step, the environ-ment is in a state s ∈ S , the agent takes an action a ∈ A and receives a reward whose expected value is R ( s, a ),and the system transitions to a new state s (cid:48) ∈ S accord-ing to the transition function T ( s, a, s (cid:48) ) = Pr( s (cid:48) | s, a ).Then, the agent receives an observation z ∈ Ω accord-ing to the observation function O ( s (cid:48) , a, z ) = Pr( z | s (cid:48) , a ).Starting from an initial belief b , the agent maintainsa belief b ( s ) about the state which is a probability dis-tribution over all the possible states. The number oftime steps for which the decision process lasts, i.e., thehorizon is denoted by h . If the agent took action a inbelief b and got an observation z , then the updated be-lief b a,z ( s ) can be computed using Bayes rule. A policy b ( s ) V ( b ) α α α Fig. 1
Illustration of the PWLC property of the value func-tion. The value function is the upper surface indicated by thesolid lines. π specifies how the agent acts in each belief. Given b ( s )and R ( s, a ), one can compute a belief-based reward, ρ ( b, a ) as: ρ ( b, a ) = (cid:88) s b ( s ) R ( s, a ) . (1)The t -step value function of a policy V πt is definedas the expected future discounted reward the agent cangather by following π for next t steps. V πt can be char-acterized recursively using the Bellman equation : V πt ( b ) (cid:44) (cid:34) ρ ( b, a π ) + (cid:88) z ∈ Ω Pr( z | a π , b ) V πt − ( b a π ,z ) (cid:35) , (2)where a π = π ( b ) and V π ( b ) = 0. The action-value func-tion Q πt ( b, a ) is the value of taking action a and follow-ing π thereafter: Q πt ( b, a ) (cid:44) ρ ( b, a ) + (cid:88) z ∈ Ω Pr( z | a, b ) V πt − ( b a,z ) . (3)The policy that maximizes V πt is called the optimal pol-icy π ∗ and the corresponding value function is called the optimal value function V ∗ t . The optimal value function V ∗ t ( b ) can be characterized recursively as: V ∗ t ( b ) = max a (cid:34) ρ ( b, a ) + (cid:88) z ∈ Ω Pr( z | a, b ) V ∗ t − ( b a,z ) (cid:35) . (4)We can also define Bellman optimality operator B ∗ :( B ∗ V t − )( b ) = max a [ ρ ( b, a ) + (cid:88) z ∈ Ω Pr( z | a, b ) V t − ( b z,a )] , and write (4) as V ∗ t ( b ) = ( B ∗ V ∗ t − )( b ) . An important consequence of these equations isthat the value function is piecewise-linear and con-vex (PWLC), as shown in Figure 1, a property ex-ploited by most POMDP planners. Sondik (1971)showed that a PWLC value function at any finite xploiting Submodular Value Functions for Scaling Up Active Perception 5 time step t can be expressed as a set of vectors: Γ t = { α , α , . . . , α m } . Each α i represents an | S | -dimensional hyperplane defining the value function overa bounded region of the belief space. The value of agiven belief point can be computed from the vectors as: V ∗ t ( b ) = max α i ∈ Γ t (cid:80) s b ( s ) α i ( s ).3.2 POMDP SolversExact methods like Monahan’s enumeration algorithm(Monahan, 1982) computes the value function for allpossible belief points by computing the optimal Γ t .Point-based planners (Pineau et al, 2006; Shani et al,2012; Spaan and Vlassis, 2005), on the other hand,avoid the expense of solving for all belief points by com-puting Γ t only for a set of sampled beliefs B . Since exactPOMDP solvers (Sondik, 1971; Monahan, 1982) are in-tractable for all but the smallest POMDPs, we focus onpoint-based methods here. Point-based methods com-pute Γ t using the following recursive algorithm.At each iteration (starting from t = 1), for each ac-tion a and observation z , an intermediate Γ a,zt is com-puted from Γ t − : Γ a,zt = { α a,zi : α i ∈ Γ t − } , (5)Next, Γ at is computed only for the sampled beliefs, i.e., Γ at = { α ab : b ∈ B } , where: α ab = Γ a + (cid:88) z ∈ Ω argmax α ∈ Γ a,zt (cid:88) s (cid:48) b ( s (cid:48) ) α ( s (cid:48) ) . (6)Finally, the best α -vector for each b ∈ B is selected: α b = argmax α ab (cid:88) s (cid:48) b ( s (cid:48) ) α ab ( s (cid:48) ) , (7) Γ t = ∪ b ∈ B α b . (8)The above algorithm at each timestep t , gener-ates | A n || Ω || Γ t − | alpha vectors in O ( | S | | A || Ω || Γ t − | )time and then reduces them to | B | vectors in O ( | S || B || A || Ω || Γ t − | ) (Pineau et al, 2006). The goal in an active perception POMDP is to reduceuncertainty about a feature of interest that is not di-rectly observable. In general, the feature of interest maybe only part of the state, e.g., if a surveillance systemcares only about people’s positions, not their velocities,or higher-level features derived from the state. However,for simplicity, we focus on the case where the feature of interest is just the state s of the POMDP. For sim-plicity, we also focus on pure active perception tasksin which the agent’s only goal is to reduce uncertaintyabout the state, as opposed to hybrid tasks where theagent may also have other goals. For such cases, hy-brid rewards (Eck and Soh, 2012), which combine theadvantage of belief-based and state-based rewards, areappropriate. Although not covered in this article, it isstraightforward to extend our results to hybrid tasks(Spaan et al, 2015).We model the active perception task as a POMDPin which an agent must choose a subset of availablesensors at each time step. We assume that all selectedsensors must be chosen simultaneously, i.e. it is not pos-sible within a timestep to condition the choice of onesensor on the observations generated by another sensor.This corresponds to the common setting where gener-ating each sensor’s observation is time consuming, e.g.,in the surveillance task, because it requires applyingexpensive computer vision algorithms, and thus all theobservations from the selected cameras must be gener-ated in parallel. Formally, an active perception POMDPhas the following components: – Actions a = (cid:104) a . . . a N (cid:105) are vectors of N binary action features , each of which specifies whether agiven sensor is selected or not. For each a , wealso define its set equivalent a = { i : a i = 1 } ,i.e., the set of indices of the selected sensors. Dueto the resource constraints, the set of all actions A = { a : | a | ≤ K } contains only sensor subsets ofsize K or less. A + = { , . . . , N } indicates the set ofall sensors. – Observations z = (cid:104) z . . . z N (cid:105) are vectors of N obser-vation features , each of which specifies the sensorreading obtained by the given sensor. If sensor i isnot selected, then z i = ∅ . The set equivalent of z is z = { z i : z i (cid:54) = ∅} . To prevent ambiguity about whichsensor generated which observation in z , we assumethat, for all i and j , the domains of z i and z j shareonly ∅ . This assumption is only made for notationalconvenience and does not restrict the applicabilityof our methods in any way.For example, in the surveillance task, a indicatesthe set of cameras that are active and z are the ob-servations received from the cameras in a . The modelfor the sensor selection problem for surveillance task isshown in Figure 2. Here, we assume that the actions in-volve only selecting K out of N sensors. The transitionfunction is thus independent of the actions, as selecting We make this assumption without loss of generality. Thefollowing sections will make it clear that none of our resultsrequire this assumption. Satsangi et al.
Fig. 2
Model for sensor selection problem sensors cannot change the state. However, as we outlinein the later subsection (6.4), it is possible to extend ourresults to general active perception POMDPs with arbi-trary transition functions, that can model, e.g., mobilesensors that, by moving, change the state.A challenge in these settings is properly formaliz-ing the reward function. Because the goal is to re-duce the uncertainty, reward is a direct function ofthe belief, not the state, i.e., the agent has no pref-erence for one state over another, so long as it knowswhat that state is. Hence, there is no meaningful wayto define a state-based reward function R ( s, a ). Di-rectly defining ρ ( b, a ) using, e.g., negative belief en-tropy : − H b ( s ) = (cid:80) s b ( s ) log( b ( s )) results in a valuefunction that is not piecewise-linear. Since ρ ( b, a ) isno longer a convex combination of a state-based re-ward function, it is no longer guaranteed to be PWLC,a property most POMDP solvers rely on. In the fol-lowing subsections, we describe two recently proposedframeworks designed to address this problem.4.1 ρ POMDPsA ρ POMDP (Araya-L´opez et al, 2010), defined by atuple (cid:104)
S, A, T, Ω, O, Γ ρ , b , h (cid:105) , is a normal POMDP ex-cept that the state-based reward function R ( s, a ) hasbeen omitted and Γ ρ has been added. Γ ρ is a set of vec-tors, that defines the immediate reward for ρ POMDP.Since we consider only pure active perception tasks, ρ depends only on b , not on a and can be writtenas ρ ( b ). Given Γ ρ , ρ ( b ) can be computed as: ρ ( b ) = b ( s ) N ega t i v e be li e f en t r op y NegativebeliefentropyAlphavectorsastangents0 0.2 0.4 0.6 0.8 1−1.4−1.2−1−0.8−0.6−0.4−0.20 b ( s ) N ega t i v e be li e f en t r op y NegativebeliefentropyAlphavectorsastangents0 0.2 0.4 0.6 0.8 1−1.4−1.2−1−0.8−0.6−0.4−0.20 b ( s ) N ega t i v e be li e f en t r op y NegativebeliefentropyAlphavectorsastangents
Fig. 3
Defining Γ aρ with different sets of tangents to the neg-ative belief entropy curve in a 2-state POMDP. max α ∈ Γ ρ (cid:80) s b ( s ) α ( s ). If the true reward function is notPWLC, e.g., negative belief entropy, it can be approxi-mated by defining Γ ρ as a set of vectors, each of whichis tangent to the true reward function. Figure 3 illus-trates approximating negative belief entropy with dif-ferent numbers of tangents.Solving a ρ POMDP requires a minor change to theexisting algorithms. In particular, since Γ ρ is a set ofvectors, instead of a single vector, an additional cross-sum is required to compute Γ a t : Γ a t = Γ ρ ⊕ Γ a , z t ⊕ Γ a , z t ⊕ . . . . Araya-L´opez et al (2010) showed that theerror in the value function computed by this approach, Arguably, there is a counter-intuitive relation between thegeneral class of POMDPs and the sub-class of pure active per-ception problems: on the one hand, the class of POMDPs isa more general set of problems, and it is intuitive to assumethat there might be harder problems in the class. On theother hand, many POMDP problems admit a representationof the value function using a finite set of vectors. In contrast,the use of entropy would require an infinite number of vec-tors to merely represent the reward function. Therefore, eventhough we consider a specific sub-class of POMDPs, this classhas properties that make it difficult to address using existingmethods.xploiting Submodular Value Functions for Scaling Up Active Perception 7 relative to the true reward function, whose tangentswere used to define Γ ρ , is bounded. However, the addi-tional cross-sum increases the computational complex-ity of computing Γ a t to O ( | S || A || Γ t − || Ω || B || Γ ρ | ) withpoint-based methods.Though ρ POMDP do not put any constraints onthe definition of ρ , we restrict the definition of ρ foran active perception POMDP to be a set of vectors en-suring that ρ is PWLC, which in turn ensures that thevalue function is PWLC. This is not a severe restric-tion because solving a ρ POMDP using offline planning requires a PWLC approximation of ρ anyway.4.2 POMDPs with Information RewardsSpaan et al. proposed POMDPs with information re-wards (POMDP-IR), an alternative framework formodeling active perception tasks that relies only onthe standard POMDP. Instead of directly rewardinglow uncertainty in the belief, the agent is given thechance to make predictions about the hidden state andrewarded, via a standard state-based reward function,for making accurate predictions. Formally, a POMDP-IR is a POMDP in which each action a ∈ A is a tuple (cid:104) a n , a p (cid:105) where a n ∈ A n is a normal action , e.g., movinga robot or turning on a camera (in our case a n is a ),and a p ∈ A p is a prediction action , which expresses pre-dictions about the state. The joint action space is thusthe Cartesian product of A n and A p , i.e., A = A n × A p .Prediction actions have no effect on states or obser-vations but can trigger rewards via the standard state-based reward function R ( s, (cid:104) a n , a p (cid:105) ). While there aremany ways to define A p and R , a simple approach is tocreate one prediction action for each state, i.e., A p = S ,and give the agent positive reward if and only if it cor-rectly predicts the true state: R ( s, (cid:104) a n , a p (cid:105) ) = (cid:40) , if s = a p , otherwise. (9)Thus, POMDP-IR indirectly rewards beliefs withlow uncertainty, since these enable more accurate pre-dictions and thus more expected reward. Furthermore,since a state-based reward function is explicitly de-fined, ρ can be defined as a convex combination of R , as in (1), guaranteeing a PWLC value function,as in a regular POMDP. Thus, a POMDP-IR can besolved with standard POMDP planners. However, theintroduction of prediction actions leads to a blowupin the size of the joint action space | A | = | A n || A p | ofPOMDP-IR. Replacing | A | with | A n || A p | in the analy-sis yields a complexity of computing Γ a t for POMDP-IRof O ( | S || A n || Γ t − || Ω || B || A p | ) for point-based methods. S" S’"A n "A p " O"R" t" t+1" Fig. 4
Influence diagram for POMDP-IR.
Note that, though not made explicit in Spaan et al(2015), several independence properties are inherentto the POMDP-IR framework, as shown in Figure 4.Specifically, the two important properties are (a) in oursetting the reward function is independent of the nor-mal actions; (b) the transition and the observation func-tion are independent of the normal actions. AlthoughPOMDP-IR can model hybrid rewards , where in ad-dition to prediction actions, normal actions can rewardagent as well (Spaan et al, 2015), in this article, becausewe focus on pure active perception, the reward function R is independent of the normal actions. Furthermore,state transitions and observations are independent ofthe prediction actions. In Section 6, we introduce a newtechnique to show that these independence propertiescan be exploited to solve a POMDP-IR much more effi-ciently and thus avoid the blowup in the size of the ac-tion space caused by the introduction of the predictionactions. Although, the reward function in our settingis independent of the normal actions, the main resultswe present in this article are not dependent on thisproperty and can be easily extended or applied to caseswhere the reward is dependent on the normal actions. ρ POMDP and POMDP-IR Equivalence ρ POMDP and POMDP-IR offer two perspectives onmodeling active perception tasks. ρ POMDP starts froma “true” belief-based reward function such as the neg-ative entropy and then seeks to find a PWLC approxi-mation via a set of tangents to the curve. By contrast,POMDP-IR starts from the queries that the user of thesystem will pose, e.g., “What is the position of every-one in the room?” or “How many people are in theroom” and creates prediction actions that reward theagent correctly for answering such queries. In this sec-tion we establish the relationship between these two
Satsangi et al. frameworks by proving the equivalence of ρ POMDPand POMDP-IR. By equivalence of ρ POMDP andPOMDP-IR, we mean that given a ρ POMDP and apolicy, we can construct a corresponding POMDP-IRand a policy such that the value function for both thepolicies is exactly the same. We show this equivalenceby starting with a ρ POMDP and a policy and introduc-ing a reduction procedure for both ρ POMDP and thepolicy (and vice-versa). Using the reduction procedure,we reduce the ρ POMDP to a POMDP-IR and the pol-icy for ρ POMDP to an equivalent policy for POMDP-IR. We then show that the value function, V πt for the ρ POMDP we started with and the reduced POMDP-IR is the same for the given and the reduced policy.To complete our proof, we repeat the same process bystarting with a POMDP-IR and then reducing it toa ρ POMDP. We show that the value function V πt forthe POMDP-IR and the corresponding ρ POMDP is thesame.
Definition 1
Given a ρ POMDP M ρ = (cid:104) S, A ρ , Ω, T ρ , O ρ , Γ ρ , b , h (cid:105) the reduce-pomdp- ρ -IR ( M ρ ) produces a POMDP-IR M IR = (cid:104) S, A IR , Ω, T IR , O IR , R IR , b , h (cid:105) via the followingprocedure. – The set of states, set of observations, initial be-lief and horizon remain unchanged. Since the setof states remain unchanged, the set of all possiblebeliefs is also the same for M IR and M ρ . – The set of normal actions in M IR is equal to the setof actions in M ρ , i.e., A n, IR = A ρ ; – The set of prediction actions A p, IR in M IR containsone prediction action for each α a p ρ ∈ Γ ρ . – The transition and observation functions in M IR behave the same as in M ρ for each a n and ignorethe a p , i.e., for all a n ∈ A n,IR : T IR ( s, a n , s (cid:48) ) = T ρ ( s, a , s (cid:48) ) and O IR ( s (cid:48) , a n , z ) = O ρ ( s (cid:48) , a , z ), where a ∈ A ρ corresponds to a n . – The reward function in M IR is defined such that ∀ a p ∈ A p , R IR ( s, a p ) = α a p ρ ( s ), where α a p ρ is the α -vector corresponding to a p .For example, consider a ρ POMDP with 2 states, if ρ is defined using tangents to belief entropy at b ( s ) = 0 . b ( s ) = 0 .
7. When reduced to a POMDP-IR, theresulting reward function gives a small negative rewardfor correct predictions and a larger one for incorrect pre-dictions, with the magnitudes determined by the valueof the tangents when b ( s ) = 0 and b ( s ) = 1: R IR ( s, a p ) = (cid:40) − . , if s = a p − . , otherwise. (10)This is illustrated in Figure 3 (top). Definition 2
Given a policy π ρ for a ρ POMDP, M ρ ,the reduce-policy- ρ -IR ( π ρ ) procedure produces apolicy π IR for a POMDP-IR as follows. For all b , π IR ( b ) = (cid:104) π ρ ( b ) , argmax a p (cid:88) s b ( s ) R ( s, a p ) (cid:105) . (11)That is, π IR selects the same normal action as π ρ andthe prediction action that maximizes expected immedi-ate reward.Using these definitions, we prove that solving M ρ isthe same as solving M IR . Theorem 1
Let M ρ be a ρ POMDP and π ρ an arbi-trary policy for M ρ . Furthermore let M IR = reduce-pomdp- ρ -IR ( M ρ ) and π IR = reduce-policy- ρ -IR ( π ρ ) . Then, for all b , V IR t ( b ) = V ρt ( b ) , (12) where V IR t is the t -step value function for π IR and V ρt is the t -step value function for π ρ .Proof See Appendix. (cid:117)(cid:116)
Definition 3
Given a POMDP-IR M IR = (cid:104) S, A IR , Ω, T IR , O IR , R IR , b , h (cid:105) the reduce-pomdp-IR- ρ ( M IR ) produces a ρ POMDP M ρ = (cid:104) S, A ρ , Ω, T ρ , O ρ , Γ ρ , b , h (cid:105) via the followingprocedure. – The set of states, set of observations, initial be-lief and horizon remain unchanged. Since the setof states remain unchanged, the set of all possiblebelief is also the same for M IR and M ρ . – The set of actions in M ρ is equal to the set of normalactions in M IR , i.e., A ρ = A n, IR . – The transition and observation functions in M ρ be-have the same as in M IR for each a n and ignore the a p , i.e., for all a ∈ A ρ : T ρ ( s, a , s (cid:48) ) = T IR ( s, a n , s (cid:48) )and O ρ ( s (cid:48) , a , z ) = O IR ( s (cid:48) , a n , z ) where a n ∈ A n, IR is the action corresponding to a ∈ A ρ . – The Γ ρ in M ρ is defined such that, for eachprediction action in A p, IR , there is a corre-sponding α vector in Γ ρ , i.e., Γ ρ = { α a p ρ ( s ) : α a p ρ ( s ) = R ( s, a p ) for each a p ∈ A p, IR } . Conse-quently, by definition, ρ is defined as: ρ ( b ) =max α apρ [ (cid:80) s b ( s ) α a p ρ ( s )]. Definition 4
Given a policy π IR = (cid:104) a n , a p (cid:105) for aPOMDP-IR, M IR , the reduce-policy-IR- ρ ( π IR ) pro-cedure produces a policy π ρ for a POMDP-IR as fol-lows. For all b , π ρ ( b ) = π n IR ( b ) , (13) xploiting Submodular Value Functions for Scaling Up Active Perception 9 Theorem 2
Let M IR be a POMDP-IR and π IR = (cid:104) a n , a p (cid:105) a policy for M IR , such that a p =argmax a (cid:48) p b ( s ) R ( s, a (cid:48) p ) . Furthermore let M ρ = reduce-pomdp-IR- ρ ( M IR ) and π ρ = reduce-policy-IR- ρ ( π IR ) . Then, for all b , V ρt ( b ) = V IRt ( b ) , (14) where V IR t is the value of following π IR in M IR and V ρt is the value of following π ρ in M ρ .Proof See Appendix. (cid:117)(cid:116)
The main implication of these theorems is thatany result that holds for either ρ POMDP or POMDP-IR also holds for the other framework. For example,the results presented in Theorem 4.3 in Araya-L´opezet al (2010) that bound the error in the value func-tion of ρ POMDP also hold for POMDP-IR. Further-more, with this equivalence, the computational com-plexity of solving ρ POMDP and POMDP-IR comes outto be the same, since POMDP-IR can be converted into ρ POMDP (and vice-versa) trivially, without any sig-nificant blow-up in representation. Although we haveproved the equivalence of ρ POMDP and POMDP-IRonly for pure active perception task where the rewardis solely conditioned on the belief, it is straightforwardto extend it to hybrid active perception tasks, wherethe reward is conditioned both on belief and the state.Although, the resulting active perception POMDP fordynamic sensor selection is such that the action doesnot affect the state, the results from this section do notuse that property at all and thus are valid for activeperception POMDPs where an agent might take an ac-tion which can affect the state in the next time step.
The POMDP-IR framework enables us to formulateuncertainty as an objective, but it does so at thecost of additional computations, as adding predic-tion actions enlarges the action space. The computa-tional complexity of performing a point-based backupfor solving POMDP-IR is O ( | S | | A n || A p || Ω || Γ t − | ) + O ( | S || B || A n || Γ t − || Ω || A p | ). In this section, we presenta new technique that exploits the independence proper-ties of POMDP-IR, mainly that the transition functionand the observation function are independent of theprediction actions, to reduce the computational costs.We also show that the same principle is applicable to ρ POMDPs.The increased computational cost of solvingPOMDP-IR arises from the size of the action space, | A n || A p | . However, as shown in Figure 4, prediction ac-tions only affect the reward function and normal actionsonly affect the observation and transition function. Weexploit this independence to decompose the maximiza-tion in the Bellman optimality equation: V ∗ t ( b ) = max (cid:104) a n ,a p (cid:105)∈ A (cid:104) (cid:88) s b ( s ) R ( s, a p )+ (cid:88) z ∈ Ω Pr( z | a n , b ) V ∗ t − ( b a n , z ) (cid:105) = max a p ∈ A p (cid:88) s b ( s ) R ( s, a p )+ max a n ∈ A n (cid:88) z ∈ Ω Pr( z | a n , b ) V ∗ t − ( b a n , z )These decomposition can be exploited by point-based methods by computing Γ a,zt only for normal ac-tions, a n and α a p only for prediction actions. That is,(5) can be changed to: Γ a n , z t = { α a n , z i : α i ∈ Γ t − } . (15)For each prediction action, we compute the vector spec-ifying the immediate reward for performing the pre-diction action in each state: Γ A p = { α a p } , where α a p ( s ) = R ( s, a p ) ∀ a p ∈ A p . The next step is to mod-ify (6) to separately compute the vectors maximizingexpected reward induced by prediction actions and theexpected return induced by the normal action: α a n b = argmax α ap ∈ Γ Ap (cid:88) s b ( s ) α a p ( s )+ (cid:88) z argmax α a n, z ∈ Γ a n, z t (cid:88) s α a n , z ( s ) b ( s ) . By decomposing the maximization, this approachavoids iterating over all | A n || A p | joint actions. At eachtimestep t , this approach generates | A n || Ω || Γ t − | + | A p | backprojections in O ( | S | | A n || Ω || Γ t − | + | S || A p | ) timeand then prunes them to | B | vectors, with a computa-tional complexity of O ( | S || B | ( | A p | + | A n || Γ t − || Ω | )).The same principle can be applied to ρ POMDP bychanging (6) such that it maximizes over immediatereward independently from the future return: α a b = argmax α ρ ∈ Γ ρ (cid:88) s b ( s ) α a p ρ ( s )+ (cid:88) z argmax α a , z ∈ Γ a , z t (cid:88) s α a , z ( s ) b ( s ) . The computational complexity of solving ρ POMDPwith this approach is O ( | S | | A || Ω || Γ t − | + | S || Γ ρ | ) + O ( | S || B | ( | Γ ρ | + | A || Γ t − || Ω | ). Thus, even though bothPOMDP-IR and ρ POMDP use extra actions or vec-tors to formulate belief-based rewards, they can both besolved at only minimal additional computational cost.
The previous sections allow us to model the active per-ception task efficiently, such that the PWLC property ofthe value function is maintained. Thus, we can now di-rectly employ traditional POMDP solvers that exploitthis property to compute the optimal value function V ∗ t .While point-based methods scale better in the sizeof the state space, they are still not practical for ourneeds as they do not scale in the size of the normalaction space of active perception POMDPs.While the computational complexity of one iterationof PBVI is linear in the size of the action space | A | of aPOMDP, for an active perception POMDP, the actionspace is modeled as selecting K out of the N availablesensors, yielding | A | = (cid:0) NK (cid:1) . For fixed K , as the numberof sensors N grows, the size of the action space and thecomputational cost of PBVI grows exponentially withit, making use of traditional POMDP solvers infeasiblefor solving active perception POMDPs.In this section, we propose greedy PBVI , a newpoint-based planner for solving active perceptionPOMDPs which scales much better in the size of the ac-tion space. To facilitate the explication of greedy PBVI,we now present the final step of PBVI, described earlierin (7) and (8), in a different way. For each b ∈ B , and a ∈ A , we must find the best α a b ∈ Γ a t . α a , ∗ b = argmax α a b ∈ Γ a t (cid:88) s α a b ( s ) b ( s ) , (16)and simultaneously record its value Q ( b, a ) = (cid:80) s α a , ∗ b b ( s ). Then, for each b we find the best vectoracross all actions: α b = α a ∗ b , where a ∗ = argmax a ∈ A Q ( b, a ) . (17)The main idea of greedy PBVI is to exploit greedymaximization (Nemhauser et al, 1978), an algorithmthat operates on a set function Q : 2 X → R . Greedymaximization is much faster than full maximization asit avoids going over the (cid:0) NK (cid:1) choices and instead con-structs a subset of K elements iteratively. Thus, wereplace the maximization operator in the Bellman opti-mality equation with greedy maximization. Algorithm 1shows the argmax variant, which constructs a subset Y ⊆ X of size K by iteratively adding elements of X to Y . At each iteration, it adds the element thatmaximally increases marginal gain ∆ Q ( e | a ) of adding asensor e to a subset of sensors a : ∆ Q ( e | a ) = Q ( b, e ∪ a ) − Q ( b, a ) . (18)To exploit greedy maximization in PBVI, we needto replace an argmax over A with greedy - argmax . Algorithm 1 greedy - argmax ( Q, X, K ) Y ← ∅ for m = 1 to K do Y ← Y ∪ { argmax e ∈ X \ Y ∆ Q ( e | Y ) } end for return Y Our alternative description of PBVI above makes thisstraightforward: (17) contains such an argmax and Q ( b, . ) has been intentionally formulated to be a setfunction over A + . Thus, implementing greedy PBVI re-quires only replacing (17) with: a G = greedy - argmax ( Q ( b, · ) , A + , K ) . (19)Since the complexity of greedy - argmax is only O ( | N || K | ), the complexity of greedy PBVI is only O ( | S || B || N || K || Γ t − | ) (as compared to O ( | S || B | (cid:0) nk (cid:1) )for traditional PBVI for computing Γ a t ).Using point-based methods as a starting pointis essential to our approach. Algorithms like Mona-han’s enumeration algorithm (Monahan, 1982) thatrely on pruning operations to compute V ∗ insteadof performing an explicit argmax, cannot directly use greedy - argmax . Thus, it is precisely because PBVI op-erates on a finite set of beliefs that an explicit argmaxis performed, opening the door to using greedy - argmax instead.7.1 Bounds given submodular value functionIn the following subsections, we present the highlightsof the theoretical guarantees associated with greedyPBVI. The detailed analysis can be found in the ap-pendix. Specifically, we show that a value function com-puted by greedy PBVI is guaranteed to have boundederror with respect to the optimal value function un-der submodularity , a property of set functions that for-malizes the notion of diminishing returns. Then, weestablish the conditions under which the value func-tion of a POMDP is guaranteed to be submodular. Wedefine ρ ( b ) as negative belief entropy, ρ ( b ) = − H b ( s )to establish the submodularity of value function. Both ρ POMDP and POMDP-IR approximate ρ ( b ) with tan-gents. Thus, in the last subsection, we show that evenif belief entropy is approximated using tangents, thevalue function computed by greedy PBVI is guaran-teed to have bounded error with respect to the optimalvalue function.Submodularity is a property of set functions thatcorresponds to diminishing returns, i.e., adding an el-ement to a set increases the value of the set functionby a smaller or equal amount than adding that same xploiting Submodular Value Functions for Scaling Up Active Perception 11 element to a subset. In our notation, this is formalizedas follows. Given a policy π , the set function Q πt ( b, a )is submodular in a , if for every a M ⊆ a N ⊆ A + and a e ∈ A + \ a N , ∆ Q b ( a e | a M ) ≥ ∆ Q b ( a e | a N ) , (20)Equivalently, Q πt ( b, a ) is submodular if for every a M , a N ⊆ A + , Q πt ( b, a M ∩ a N )+ Q πt ( b, a M ∪ a N ) ≤ Q πt ( b, a M )+ Q πt ( b, a N ) . Submodularity is an important property because ofthe following result:
Theorem 3 (Nemhauser et al, 1978) If Q πt ( b, a ) isnon-negative, monotone and submodular in a , then forall b , Q πt ( b, a G ) ≥ (1 − e − ) Q πt ( b, a ∗ ) , (21) where a G = greedy - argmax ( Q πt ( b, · ) , A + , K ) and a ∗ =argmax a ∈ A Q πt ( b, a ) . Theorem 3 gives a bound only for a single applica-tion of greedy - argmax , not for applying it within eachbackup, as greedy PBVI does. In this subsection, we es-tablish such a bound. Let the greedy Bellman operator B G be:( B G V πt − )( b ) = G max a [ ρ ( b, a )+ γ (cid:88) z ∈ Ω Pr( z | a , b ) V πt − ( b a , z )] , where max G a refers to greedy maximization. This imme-diately implies the following corollary to Theorem 3: Corollary 1
Given any policy π , if Q πt ( b, a ) is non-negative, monotone, and submodular in a , then for all b , ( B G V πt − )( b ) ≥ (1 − e − )( B ∗ V πt − )( b ) . (22) Proof
From Theorem 3 since ( B G V πt − )( b ) = Q πt ( b, a G )and ( B ∗ V πt − )( b ) = Q πt ( b, a ∗ ). (cid:117)(cid:116) Next, we define the greedy Bellman equation : V Gt ( b ) = ( B G V Gt − )( b ), where V G = ρ ( b ). Note that V Gt is the true value function obtained by greedy maximiza-tion, without any point-based approximations. UsingCorollary 1, we can bound the error of V G with respectto V ∗ . Theorem 4
If for all policies π , Q πt ( b, a ) is non-negative, monotone and submodular in a , then for all b , V Gt ( b ) ≥ (1 − e − ) t V ∗ t ( b ) . (23) Proof
See Appendix. Theorem 4 extends Nemhauser’s result to a full se-quential decision making setting where multiple ap-plication of greedy maximization are employed overmultiple time steps. This theorem gives a theoreticalguarantee on the performance of greedy PBVI. Givena POMDP with a submodular value function, greedyPBVI is guaranteed to have bounded error with respectto the optimal value function. Moreover, this perfor-mance comes at a computational cost that is much lessthan that of solving the same POMDP with traditionalsolvers. Thus, greedy PBVI scales much better in thesize of the action space of active perception POMDPs,while still retaining bounded error.The results presented in this subsection are appli-cable only if the value function for a POMDP is sub-modular. In the following subsections, we establish thesubmodularity of value function for active perceptionPOMDP under certain conditions.7.2 Submodularity of value functionsThe previous subsection showed that the value func-tion computed by greedy PBVI is guaranteed to havebounded error as long as it is non-negative, monotoneand submodular. In this subsection, we establish suf-ficient conditions for these properties to hold. Specifi-cally, we show that, if the belief-based reward is nega-tive entropy, i.e., ρ ( b ) = − H b ( s ) + log( | S | ) then undercertain conditions Q πt ( b, a ) is submodular, non-negativeand monotone as required by Theorem 4. We point outthat the second part, log( | S | ) is only required (and suf-ficient) to guarantee non-negativity, but is independentof the actual beliefs or actions. For the sake of concise-ness, in the remainder of this paper we will omit thisterm.We start by observing that Q πt ( b, a ) = ρ ( b ) + (cid:80) t − k =1 G πk ( b t , a t ), where G πk ( b t , a t ) is the expected im-mediate reward with k steps to go, conditioned on thebelief and action with t steps to go and assuming policy π is followed after timestep t : G πk ( b t , a t ) = γ t − k (cid:88) z t : k P r ( z t : k | b t , a t , π )( − H b k ( s k )) , where z t : k is a vector of observations received in theinterval from t steps to go to k steps to go, b t is thebelief at t steps to go, a t is the action taken at t stepsto go, and ρ ( b k ) = − H b k ( s k ), where s k is the state at k steps to go. To show that Q πt ( b, a ) is submodular themain condition is conditional independence as definedbelow: Definition 5
The observation set z is conditionally in-dependent given s if any pair of observation features are conditionally independent given the state, i.e.,Pr( z i , z j | s ) = Pr( z i | s ) Pr( z j | s ) , ∀ z i , z j ∈ z . (24)Using above definition, the submodularity of Q ( b, a )can be established as: Theorem 5 If z t : k is conditionally independent given s k and ρ ( b ) = − H b ( s ) , then Q πt ( b, a ) is submodular in a , for all π .Proof See Appendix.
Theorem 6 If z t : k is conditionally independent given s k , V πt is convex over the belief space for all t, π , and ρ ( b ) = − H b ( s ) + log ( | S | ) , then for all b , V Gt ( b ) ≥ (1 − e − ) t V ∗ t ( b ) . (25) Proof
See Appendix.In this subsection we showed that if the immediatebelief-based reward ρ ( b ) is defined as negative belief en-tropy, then the value function of an active perceptionPOMDP is guaranteed to be submodular under cer-tain conditions. However, as mentioned earlier, to solveactive perception POMDP, we approximate the beliefentropy with vector tangents. This might interfere withthe submodularity of the value function. In the nextsubsection, we show that, even though the PWLC ap-proximation of belief entropy might interfere with thesubmodularity of the value function, the value functioncomputed by greedy PBVI is still guaranteed to havebounded error.7.3 Bounds given approximated belief entropyWhile Theorem 6 bounds the error in V Gt ( b ), it doesso only on the condition that ρ ( b ) = − H b ( s ). How-ever, as discussed earlier, our definition of active per-ception POMDPs instead defines ρ using a set of vec-tors Γ ρ = { α ρ , . . . , α ρm } , each of which is a tangent to − H b ( s ), as suggested by Araya-L´opez et al (2010), inorder to preserve the PWLC property. While this caninterfere with the submodularity of Q πt ( b, a ), here weshow that the error generated by this approximation isstill bounded in this case.Let ˜ ρ ( b ) denote the PWLC approximated entropyand ˜ V ∗ t denote the optimal value function when us-ing a PWLC approximation to negative entropy for thebelief-based reward, as in an active perception POMDP,i.e.,˜ V ∗ t ( b ) = max a [˜ ρ ( b ) + (cid:88) z ∈ Ω Pr( z | b, a ) ˜ V ∗ t − ( b a , z )] . (26) Araya-L´opez et al (2010) showed that, if ρ ( b ) verifiesthe α -H¨older condition (Gilbarg and Trudinger, 2001),a generalization of the Lipschitz condition, then the fol-lowing relation holds between V ∗ t and ˜ V ∗ t : || V ∗ t − ˜ V ∗ t || ∞ ≤ Cδ α − γ , (27)where V ∗ t is the optimal value function with ρ ( b ) = − H b ( s ), δ is the density of the set of belief points atwhich tangent are drawn to the belief entropy, and C is a constant.Let ˜ V Gt ( b ) be the value function computed by greedyPBVI when immediate belief-based reward is ˜ ρ ( b ):˜ V Gt ( b ) = G max a [˜ ρ ( b ) + (cid:88) z ∈ Ω Pr( z | b, a ) ˜ V Gt − ( b a , z )] , (28)then the error between ˜ V Gt ( b ) and V ∗ t ( b ) is bounded asstated in the following theorem. Theorem 7
For all beliefs, the error between ˜ V Gt ( b ) and ˜ V ∗ t ( b ) is bounded, if ρ ( b ) = − H b ( s ) , V πt is convexin the belief space for all π, t , and if z t : k is conditionallyindependent given s k .Proof See Appendix.In this subsection we showed that if the negativeentropy is approximated using tangent vectors, greedyPBVI still computes a value function that has boundederror. In the next subsection we outline how greedyPBVI can be extended to general active perceptiontasks.7.4 General Active Perception POMDPsThe results presented in this section apply to the activeperception POMDP in which the evolution of the stateover time is independent of the actions of the agent.Here, we outline how these results can be extendedto general active perception POMDPs without manychanges. The main application for such an extensionis in tasks involving a mobile robot coordinating withsensors to intelligently take actions to perceive its envi-ronment. In such cases, the robot’s actions, by causingit to move, can change the state of the world.The algorithms we proposed can be extended tosuch settings by making small modifications to thegreedy maximization operator. The greedy algorithmcan be run for K + 1 iterations and in each iterationthe algorithm would choose to add either a sensor (onlyif fewer than K sensors have been selected), or a move-ment action (if none has been selected so far). Formally,using the work of Fisher et al (1978), which extends that xploiting Submodular Value Functions for Scaling Up Active Perception 13 of Nemhauser et al (1978) on submodularity to combi-natorial structures such as matroids , the action space ofa POMDP involving a mobile robot can be modeled asa partition matroid and greedy maximization subjectto matroid constraints (Fisher et al, 1978) can be usedto maximize the value function approximately.The guarantees associated with greedy maximiza-tion subject to matroid constraints (Fisher et al,1978) can then be used to bound the error of greedyPBVI. However, deriving exact theoretical guaranteesfor greedy PBVI for such tasks is beyond the scope ofthis article. Assuming that the reward function is stilldefined as the negative belief entropy, the submodular-ity of such POMDPs still holds under the conditionsmentioned in Section 6.2.In this subsection, we presented greedy PBVI, whichuses greedy maximization to improve the scalabilityin the action space of an active perception POMDP.We also showed that, if the value function of an activeperception POMDP is submodular, then greedy PBVIcomputes a value function that is guaranteed to havebounded error. We established that if the belief-basedreward is defined as the negative belief entropy, thenthe value function of an active perception POMDP isguaranteed to be submodular. We showed that if thenegative belief entropy is approximated by tangent vec-tors, as is required to solve active perception POMDPsefficiently, greedy PBVI still computes a value functionthat has bounded error. Finally, we outlined how greedyPBVI and the associated theoretical bounds can be ex-tended to general active perception POMDPs. In this section, we present an analysis of the behav-ior and performance of belief-based rewards for activeperception tasks, which is the main motivation of ourwork. We present the results of experiments designedto study the effect on the performance of the choiceof prediction actions/tangents, and compare the costsand benefits of myopic versus non-myopic planning. Weconsider the task of tracking people in a surveillancearea with a multi-camera tracking system. The goal ofthe system is to select a subset of cameras, to correctlypredict the position of people in the surveillance area,based on the observations received from the selectedcameras. In the following subsections, we present re-sults on real-data collected from a multi-camera sys-tem in a shopping mall and we present the experimentscomparing performance of greedy PBVI to PBVI.We compare the performance of POMDP-IR withdecomposed maximization to a naive POMDP-IR that
Fig. 5
Problem setup for the task of tracking one person.We model this task as a POMDP with one state for each cell.Thus the person can move among | S | cells. Each cell is adja-cent to two other cells and each cell is monitored by a singlecamera. Thus, in this case there are N = | S | cameras. At eachtime step, the person can stay in the same cell as she was inthe previous time step with probability p or she can move toone of the neighboring cells with equal probability. The agentmust select K out of N cameras and the task is to predict thestate of the person correctly using noisy observations from the K cameras. There is one prediction action for each state andthe agent gets a reward of +1 if it correctly predicts the stateand 0 otherwise. An observation is a vector of N observationfeatures , each of which specifies the person’s position as esti-mated by the given camera. If a camera is not selected, thenthe corresponding observation feature has a value of null. does not decompose the maximization. Thanks to The-orems 1 and 2, these approaches have performanceequivalent to their ρ POMDP counterparts. We alsocompare against two baselines. The first is a weak base-line we call the rotate policy in which the agent simplykeeps switching between cameras on a turn-by-turn ba-sis. The second is a stronger baseline we call the cover-age policy , which was developed in earlier work on ac-tive perception (Spaan, 2008; Spaan and Lima, 2009).The coverage policy is obtained after solving a POMDPthat rewards the agent for observing the person, i.e., theagent is encouraged to select the cameras that are mostlikely to generate positive observations. Thanks to thedecomposed maximization, the computational cost ofsolving for the coverage policy and belief-based rewardsis the same.8.1 Simulated SettingWe start with experiments conducted in a simulatedsetting, first considering the task of tracking a singleperson with a multi-camera system and then consider-ing the more challenging task of tracking multiple peo-ple. C u m u l a t i v e R e w a r d POMDP−IR with decomposed maximizationCoverage rewardNaive POMDP−IRRotate Policy (a) R un t i m e ( i n S e c ond s ) POMDP−IR with decomposed maximizationNaive POMDP−IR (b)
Timestep M a x o f B e li e f Max of beliefPosition of personcovered by chosen camera (c) M a x o f be li e f Max of beliefPosition of personcovered by chosen camera (d)
Fig. 6 (a) Performance comparison between POMDP-IR with decomposed maximization, naive POMDP-IR, coverage policy,and rotate policy; (b) Runtime comparison between POMDP-IR with decomposed maximization and naive POMDP-IR; (c)Behavior of POMDP-IR policy; (d) Behavior of coverage policy.
We start by considering the task of tracking one per-son walking in a grid-world composed of | S | cells and N cameras as shown in Figure 5. At each timestep,the agent can select only K cameras, where K ≤ N .Each selected camera generates a noisy observation ofthe person’s location. The agent’s goal is to minimizeits uncertainty about the person’s state. In the experi-ments in this section, we fixed K = 1 and N = 10. Theproblem setup and the POMDP model is shown anddescribed in Figure 5.To compare the performance of POMDP-IR to thebaselines, 100 trajectories were simulated from thePOMDP. The agent was asked to guess the person’sposition at each time step. Figure 6(a) shows the cumu-lative reward collected by all four methods. POMDP-IR with decomposed maximization and naive POMDP-IR perform identically as the lines indicating their re-spective performance lie on top of each other in figure6(a). However, Figure 6(b), which compares the run-times of POMDP-IR with decomposed maximizationand naive POMDP-IR, shows that decomposed maxi-mization yields a large computational savings. Figure6(a) also shows that POMDP-IR greatly outperforms the rotate policy and modestly outperforms the cover-age policy.Figures 6(c) and 6(d) illustrate the qualitative dif-ference between POMDP-IR and the coverage policy.The blue lines mark the points in trajectory when theagent selected the camera that observes the person’slocation. If the agent selected a camera such that theperson’s location is not covered then the blue verticalline is not there at that point in the trajectory in thefigure. The agent has to select one out of N camerasand does not have an option of not selecting any cam-era. The red line plots the max of the agent’s belief.The main difference between the two policies is thatonce POMDP-IR gets a good estimate of the state, itproactively observes neighboring cells to which the per-son might transition. This helps it to more quickly findthe person when she moves. By contrast, the coveragepolicy always looks at the cell where it believes her tobe. Hence, it takes longer to find her again when shemoves. This is evidenced by the fluctuations in the maxof the belief, which often drops below 0.5 for the cov-erage policy but rarely does so for POMDP-IR. Thepresence of false positives and negatives can also beseen in the figure, when max of the belief goes downeven though the agent selected the camera which canobserve the person’s location and in some cases even xploiting Submodular Value Functions for Scaling Up Active Perception 15 C u m u l a t i v e be li e f en t r op y C u m u l a t i v e R e w a r d Fig. 7
Performance comparison as negative belief entropy isbetter approximated. though the agent did not select the camera which canobserve the person’s location but still the max of beliefshoots up.Next, we examine the effect of approximating a truereward function like belief entropy with more and moretangents. Figure 3 illustrates how adding more tangentscan better approximate negative belief entropy. To testthe effects of this, we measured the cumulative rewardwhen using between one and four tangents per state.Figure 7 shows the results and demonstrates that, asmore tangents are added, the performance improves.However, performance also quickly saturates, as fourtangents perform no better than three.Next, we compare the performance of POMDP-IRto a myopic variant that seeks only to maximize im-mediate reward, i.e., h = 1. We perform this compar-ison in three variants of the task. In the highly static variant, the state changes very slowly: the probabil-ity of staying is the same state is 0.9. In the mod-erately dynamic variant, the state changes more fre-quently, with a same-state transition probability of 0.7.In the highly dynamic variant, the state changes rapidly(with a same-state transition probability of 0.5). Fig-ure 8 (top) shows the results of these comparisons. Ineach setting, non-myopic POMDP-IR outperforms my-opic POMDP-IR. In the highly static variant, the dif-ference is marginal. However, as the task becomes moredynamic, the importance of look-ahead planning grows.Because the myopic planner focuses only on immedi-ate reward, it ignores what might happen to its beliefwhen the state changes, which happens more often indynamic settings.We also compare the performance of myopic andnon-myopic planning in a budget-constrained environ-ment. This specifically corresponds to an energy con- C u m u l a t i v e r e w a r d Myopic − Highly DynamicNon−Myopic Highly DynamicMyopic − Moderately DynamicNon−Myopic − Moderately DynamicMyopic Highly StaticNon−Myopic Highly Static C u m u l a t i v e r e w a r d Non−myopic planning, h = 50Myopic planning, h = 1Budget = 15
Fig. 8 (top) Performance comparison for myopic vs. non my-opic policies; (bottom) Performance comparison for myopic vsnon myopic policies in budget-based setting. strained environment, where cameras can be employedonly a few times over the entire trajectory. This is aug-mented with resource constraints, so that the agent hasto plan not only when to use the cameras, but also de-cide which camera to select. Specifically, the agent canonly employ the multi-camera system a total of 15 timesacross all 50 timesteps and the agent can select whichcamera (out of the multi-camera system) to employ ateach of the 15 instances. On the other timesteps, it mustselect an action that generates only a null observation.Figure 8 (bottom) shows that non-myopic planning isof critical importance in this setting. Whereas myopicplanning greedily consumes the budget as quickly aspossible, thus earning more reward in the beginning,non-myopic planning saves the budget for situations inwhich it is highly uncertain about the state.Finally, we compare the performance of myopic andnon-myopic planning when the multi-camera systemcan communicate with a mobile robot that also has sen-sors. This setting is typical of a networked robot system(Spaan et al, 2010) in which a robot coordinates with amulti-camera system to perform surveillance of a build-ing, detect any emergency situations like fire, or helppeople navigate to their destination. Here, the task isto minimize uncertainty about the location of one per-son who is moving in the space monitored by the robotand the cameras. The robot’s sensors are assumed to bemore accurate than the stationary cameras. Specifically,the sensors attached to the robot can detect if a personis in the current cell with 90% accuracy compared to6 Satsangi et al.
Fig. 8 (top) Performance comparison for myopic vs. non my-opic policies; (bottom) Performance comparison for myopic vsnon myopic policies in budget-based setting. strained environment, where cameras can be employedonly a few times over the entire trajectory. This is aug-mented with resource constraints, so that the agent hasto plan not only when to use the cameras, but also de-cide which camera to select. Specifically, the agent canonly employ the multi-camera system a total of 15 timesacross all 50 timesteps and the agent can select whichcamera (out of the multi-camera system) to employ ateach of the 15 instances. On the other timesteps, it mustselect an action that generates only a null observation.Figure 8 (bottom) shows that non-myopic planning isof critical importance in this setting. Whereas myopicplanning greedily consumes the budget as quickly aspossible, thus earning more reward in the beginning,non-myopic planning saves the budget for situations inwhich it is highly uncertain about the state.Finally, we compare the performance of myopic andnon-myopic planning when the multi-camera systemcan communicate with a mobile robot that also has sen-sors. This setting is typical of a networked robot system(Spaan et al, 2010) in which a robot coordinates with amulti-camera system to perform surveillance of a build-ing, detect any emergency situations like fire, or helppeople navigate to their destination. Here, the task isto minimize uncertainty about the location of one per-son who is moving in the space monitored by the robotand the cameras. The robot’s sensors are assumed to bemore accurate than the stationary cameras. Specifically,the sensors attached to the robot can detect if a personis in the current cell with 90% accuracy compared to6 Satsangi et al.
Timestep C u m u l a t i v e r e w a r d Non−myopicMyopic
Fig. 9
Performance comparison for myopic vs. non myopicpolicies when camera system is assisting a moving robot. the stationary cameras, each of which has an accuracyof 75% of detecting a person in the cell it observes. Therobot’s sensor can observe the presence or absence ofa person only for the cell that the robot occupies. Inaddition to using its sensors to generate observationsabout its current cell, the robot can also move forwardor backward to an adjacent cell or choose to stay at thecurrent cell. To model this task, the action vector intro-duced earlier is augmented with another action featurethat indicates the direction of the robot’s motion, whichcan take three values: forward, backward or stay.Performance is quantified as the total number oftimes the correct location of the person is predicted bythe system. Figure 9, which shows the performance ofmyopic and non-myopic policies for this task, demon-strates that when planning non-myopically the agent isable to utilize the accurate sensors more effectively asto compared to when planning myopically.
To extend our analysis to a more challenging prob-lem, we consider a simulated setting in which multi-ple people must be tracked simultaneously. Since | S | grows exponentially in the number of people, the result-ing POMDP quickly becomes intractable. Therefore, wecompute instead a factored value function V t ( b ) = (cid:88) i V it ( b i ) , (29)where V it ( b i ) is the value of the agent’s current belief b i about the i -th person. Thus, V it ( b i ) needs to be com-puted only once, by solving a POMDP of the same sizeas that in the single-person setting. During action se-lection, V t ( b ) is computed using the current b i for eachperson. This kind of factorization corresponds to theassumption that each person’s movement and observa-tions is independent of that of other people. Althoughviolated in practice, such an assumption can nonethe-less yield good approximations. C u m u l a t i v e R e w a r d Coverage POMDP−IR C u m u l a t i v e R e w a r d Coverage POMDP−IR
Fig. 10 (top) Multi-person tracking performance forPOMDP-IR and coverage policy; (bottom) Performance ofPOMDP-IR and coverage policy when only important cellsmust be tracked.
Figure 10 (top), which compares POMDP-IR to thecoverage policy with one, two, and three people, showsthat the advantage of POMDP-IR grows substantiallyas the number of people increases. Whereas POMDP-IRtries to maintain a good estimate of everyone’s position,the coverage policy just tries to look at the cells wherethe maximum number of people might be present, ig-noring other cells completely.Finally, we compare POMDP-IR and the coveragepolicy in a setting in which the goal is only to reduceuncertainty about a set of “important cells” that area subset of the whole state space. For POMDP-IR, weprune the set of prediction actions to allow predictionsonly about important cells. For the coverage policy, wereward the agent only for observing people in importantcells. The results, shown in Figure 10 (bottom), demon-strate that the advantage of POMDP-IR over the cov-erage policy is even larger in this variant of the task.POMDP-IR makes use of information coming from cellsthat neighbor the important cells (which is of criticalimportance if the important cells do not have good ob-servability), while the coverage policy does not. As be-fore, the difference gets larger as the number of peopleincreases.8.2 Real DataFinally, we extended our analysis to a real-life datasetcollected in a shopping mall. This dataset was gath- xploiting Submodular Value Functions for Scaling Up Active Perception 17
Fig. 11
Sample tracks for all the cameras. Each color repre-sents all the tracks observed by a given camera. The boxesdenote regions of high overlap between cameras. C u m u l a t i v e r e w a r d Coverage policyPOMDP−IR
Fig. 12
Performance of POMDP-IR and the coverage policyon the shopping mall dataset. ered over 4 hours using 13 CCTV cameras located in ashopping mall (Bouma et al, 2013). Each camera usesa FPDW (Dollar et al, 2010) pedestrian detector to de-tect people in each camera image and in-camera track-ing (Bouma et al, 2013) to generate tracks of the de-tected people’s movements over time.The dataset consists of 9915 tracks each specifyingone person’s x - y position over time. Figure 11 showsthe sample tracks from all of the cameras.To learn a POMDP model from the dataset, we di-vided the continuous space into 20 cells ( | S | = 21: 20cells plus an external state indicating the person hasleft the shopping mall). Using the data, we learned amaximum-likelihood tabular transition function. How-ever, we did not have access to the ground truth ofthe observed tracks so we constructed them using theoverlapping regions of the camera.Because the cameras have many overlapping regions(see Figure 11), we were able to manually match tracksof the same person recorded individually by each cam-era. The “ground truth” was then constructed by tak-ing a weighted mean of the matched tracks. Finally, thisground truth was used to estimate noise parameters foreach cell (assuming zero-mean Gaussian noise), which N = 5, K = 2 N = 11, K = 2 N = 13, K = 2 N = 7, K = 3 N = 9, K = 3 N = 11, K = 30100020003000400050006000700080009000 T i m e ( i n S e c ond s ) Myopic Greedy PBVIMyopic PBVINon−Myopic Greedy PBVINon−Myopic PBVI
Fig. 13
Runtimes for the different methods. was used as the observation function. Figure 12 showsthat, as before, POMDP-IR substantially outperformsthe coverage policy for various numbers of cameras. Inaddition to the reasons mentioned before, the high over-lap between the cameras contributes to POMDP-IR’ssuperior performance. The coverage policy has difficultyascertaining people’s exact locations because it is re-warded only for observing them somewhere in a cam-era’s large overlapping region, whereas POMDP-IR isrewarded for deducing their exact locations.8.3 Greedy PBVITo empirically evaluate greedy PBVI, we tested it onthe problem of tracking either one or multiple peopleusing a multi-camera system. The reward function isdescribed as a set of | S | vectors, Γ ρ = { α . . . α | S | } ,with α i ( s ) = 1 if s = i and α i ( s ) = 0 otherwise. Theinitial belief is uniform across all states. We planned forhorizon h = 10 with γ = 0 . myopic versions of both greedy and regular PBVI thatcompute a policy assuming h = 1 and use it at eachtimestep. Figure 13 shows runtimes under differentvalues of N and K . Since multi-person tracking usesthe value function obtained by solving a single-personPOMDP, single and multi-person tracking have thesame runtimes. These results demonstrate that greedyPBVI requires only a fraction of the computational costof regular PBVI. In addition, the difference in the run-time grows quickly as the action space gets larger: for N = 5 and K = 2 greedy PBVI is twice as fast, whilefor N = 11 , K = 3 it is approximately nine times asfast. Thus, greedy PBVI enables much better scalabil-ity in the action space. Figure 14, which shows the cu-mulative reward under different values of N and K forsingle-person (top) and multi-person (bottom) track-ing, verifies that greedy PBVI’s speedup does not comeat the expense of performance, as greedy PBVI accu-mulates nearly as much reward as regular PBVI. Theyalso show that both PBVI and greedy PBVI benefit N=5, K=2 N=11, K=2 N=13,K=2 N=7,K=3 N=9,K=3 N=11,K=30123456 C u m u l a t i v e R e w a r d Myopic Greedy PBVIMyopic PBVINon−Myopic Greedy PBVINon−Myopic PBVI C u m u l a t i v e R e w a r d Myopic Greedy PBVIMyopic PBVINon−Myopic Greedy PBVINon−Myopic PBVI
N = 13K = 2
Fig. 14
Cumulative reward for single-person (top) and multi-person (bottom) tracking. from non-myopic planning. While the performance ad-vantage of non-myopic planning is relatively modest,it increases with the number of cameras and people,which suggests that non-myopic planning is importantto making active perception scalable.Furthermore, an analysis of the resulting policiesshowed that myopic and non-myopic policies differ qual-itatively. A myopic policy, in order to minimize un-certainty in the next step, tends to look where it be-lieves the person to be. By contrast, a non-myopic pol-icy tends to proactively look where the person mightgo next, so as to more quickly detect her new locationwhen she moves. Consequently, non-myopic policies ex-hibit less fluctuation in belief and accumulate more re-ward, as illustrated in Figure 15. The blue lines markwhen the agent chooses the camera that can observethe cell occupied by the person. The red line plots themax of the agent’s belief. The difference in fluctuationin belief is evident, as the max of the belief often dropsbelow 0.5 for the myopic policy but rarely does so forthe non-myopic policy.
In this article, we addressed the problem of active per-ception, in which an agent must take actions to re-duce uncertainty about a hidden variable while reason-ing about various constraints. Specifically, we modeledthe task of surveillance with multi-camera tracking sys- M a x o f be li e f Max of beliefPosition of personcovered by chosen cameraNon−myopic M a x o f be li e f Max of beliefPosition of personcovered by chosen cameraMyopic
Fig. 15
Behavior of myopic vs. non-myopic policy. tems in large urban spaces as an active perception task.Since the state of the environment is dynamic, we modelthis task as a POMDP to compute closed-loop non-myopic policies that can reason about the long-termconsequences of selecting a subset of sensors.Formulating uncertainty reduction as an end in it-self is a challenging task, as it breaks the PWLC prop-erty of the value function, which is imperative for solv-ing POMDPs efficiently. ρ POMDP and POMDP-IR aretwo frameworks that allow formulating uncertainty re-duction as an end in itself and does not break thePWLC property.We showed that ρ POMDP and POMDP-IR are twoequivalent frameworks for modeling active perceptiontask. Thus, results that apply to one framework arealso applicable to the other. While ρ POMDP does notrestrict the definition of ρ to a PWLC function, in thiswork we restrict the definition of ρ POMDP to a casewhere ρ is approximated with a PWLC function, as itis not feasible to efficiently solve a ρ POMDP where the ρ is not a PWLC function.We model the action space of the active perceptionPOMDP as selecting K out of N sensors, where K is themaximum number of sensors allowed by the resourceconstraints. Recent POMDP solvers enable scalabilityin the state space. However, for active perception, asthe number of sensors grow, the action space growsexponentially. We proposed greedy PBVI, a POMDPplanning method, that improves scalability in the ac-tion space of a POMDP. While we do not directly ad- xploiting Submodular Value Functions for Scaling Up Active Perception 19 dress the scaling in the observation space, we believe re-cent ideas on factorization of observation space (Veigaet al, 2014) can be combined with our approach to im-prove scalability in state, action and observation spaceto solve active perception POMDPs.By leveraging the theory of submodularity, weshowed that the value function computed by greedyPBVI is guaranteed to have bounded error. Specifi-cally, we extend Nemhauser’s result on greedy maxi-mization of submodular functions to long-term plan-ning. To apply these results to the active perceptiontask, we showed that under certain conditions the valuefunction of an active perception POMDP is submod-ular. One such condition requires that the series fu-ture of observations be independent of each other giventhe state. While this is a strong condition, it is onlya sufficient condition and not may not be a necessaryone. Thus, one line of future work is to attempt to re-lax this condition for proving the submodularity of thevalue function. Finally, we showed that, even with aPWLC approximation to the true value function, whichis submodular, the error in the value function computedby greedy PBVI remains bounded, thus enabling us tocompute efficiently value functions for active perceptionPOMDP.Greedy PBVI is ideally suited for active perceptionPOMDPs for which the value function is submodular.However, in real-life situations submodularity of valuefunction might not always hold. For example, in oursetting when there is occlusion, it is possible for com-binations of sensors that when selected together yieldhigher utility than the sum of their utilities when se-lected individually. Similar case can arise when a mobilerobots is trying to sense the best point of view to ob-serve a scene that is occluded. Thus in cases like this,greedy PBVI might not return the best solution.Our empirical analysis established the critical fac-tors involved in the performance active perceptiontasks. We showed that a belief-based formulation ofuncertainty reduction beats a corresponding popularstate-based reward baseline as well as other simple poli-cies. While, the non-myopic policy beats the myopicone, the gain in certain cases the gain is marginal. How-ever, in cases involving mobile sensors and budgetedconstraints, non-myopic policies become critically im-portant. Finally, experiments on a real-world datasetshowed that the performance of greedy PBVI is similarto the existing methods but requires only a fraction ofthe computational cost, leading to much better scala-bility for solving active perception tasks.
10 Appendix
Theorem 1
Let M ρ be a ρ POMDP and π ρ an arbi-trary policy for M ρ . Furthermore let M IR = reduce-pomdp- ρ -IR ( M ρ ) and π IR = reduce-policy- ρ -IR ( π ρ ) . Then, for all b , V IR t ( b ) = V ρt ( b ) , (30) where V IR t is the t -step value function for π IR and V ρt is the t -step value function for π ρ .Proof By induction on t . To prove the base case, weobserve that, from the definition of ρ ( b ), V ρ ( b ) = ρ ( b ) = max α apρ ∈ Γ ρ (cid:88) s b ( s ) α a p ρ ( s ) . Since M IR has a prediction action correspond-ing to each α a p ρ , thus the a p corresponding to α = argmax α apρ ∈ Γ ρ (cid:80) s b ( s ) α a p ρ ( s ), must also maximize (cid:80) s b ( s ) R ( s, a p ). Then, V ρ ( b ) = max a p (cid:88) s b ( s ) R IR ( s, a p )= V IR ( b ) . (31)For the inductive step, we assume that V IR t − ( b ) = V ρt − ( b ) and must show that V IR t ( b ) = V ρt ( b ). Startingwith V IR t ( b ), V IR t ( b ) = max a p (cid:88) s b ( s ) R ( s, a p )+ (cid:88) z Pr( z | b, π n IR ( b )) V IR t − ( b π n IR ( b ) , z ) , (32)where π n IR ( b ) denotes the normal action of the tuplespecified by π IR ( b ) and: Pr( z | b, π n IR ( b )) = (cid:80) s (cid:80) s (cid:48)(cid:48) O IR ( s (cid:48)(cid:48) , π n IR ( b ) , z ) T IR ( s, π n IR ( b ) , s (cid:48)(cid:48) ) b ( s ) . Using the reduction procedure, we can replace T IR and O IR and π n IR ( b ) with their ρ POMDP counterpartson right hand side of the above equation:
Pr( z | b, π n IR ( b )) = (cid:80) s (cid:80) s (cid:48)(cid:48) O ρ ( s (cid:48)(cid:48) , π ρ ( b ) , z ) T ρ ( s, π ρ ( b ) , s (cid:48)(cid:48) ) b ( s ) = Pr( z | b, π ρ ( b )) . Similarly, for the belief update equation, b π n IR ( b ) , z = O IR ( s (cid:48) , π n IR ( b ) , z )Pr( z | π n IR ( b ) , b ) (cid:88) s b ( s ) T IR ( s, π n IR ( b ) , s (cid:48) )= O ρ ( s (cid:48) , π ρ ( b ) , z )Pr( z | π ρ ( b ) , b ) (cid:88) s b ( s ) T ρ ( s, π ρ ( b ) , s (cid:48) )= b π ρ ( b ) , z . (33)Substituting the above result in (32) yields: V IR t ( b ) = max a p (cid:80) s b ( s ) R ( s, a p ) + (cid:80) z P r ( z | b, π ρ ( b )) V IR t − ( b π ρ ( b ) , z ) . (34)Since the inductive assumption tells us that V IR t − ( b ) = V ρt − ( b ) and (31) shows that ρ ( b ) =max a p (cid:80) s b ( s ) R ( s, a p ): V IR t ( b ) = [ ρ ( b ) + (cid:88) z P r ( z | b, π ρ ( b )) V ρt − ( b π ρ ( b ) , z )]= V ρt ( b ) . (35) (cid:117)(cid:116) Theorem 2
Let M IR be a POMDP-IR and π IR = (cid:104) a n , a p (cid:105) an policy for M IR , such that a p =max a (cid:48) p b ( s ) R ( s, a (cid:48) p ) . Furthermore let M ρ = reduce-pomdp-IR- ρ ( M IR ) and π ρ = reduce-policy-IR- ρ ( π IR ) . Then, for all b , V ρt ( b ) = V IRt ( b ) , (36) where V IR t is the value of following π IR in M IR and V ρt is the value of following π ρ in M ρ .Proof By induction on t . To prove the base case, weobserve that, from the definition of ρ ( b ), V IR ( b ) = max a p (cid:88) s b ( s ) R ( s, a p )= (cid:88) s b ( s ) α ( s ) { where α ( s ) is the α ( s )corresponding to a p = argmax a (cid:48) p (cid:80) s b ( s ) R ( s, a (cid:48) p ) . } = ρ ( b )= V ρ ( b ) (37)For the inductive step, we assume that V ρt − ( b ) = V IRt − ( b ) and must show that V ρt ( b ) = V IRt ( b ). Startingwith V ρt ( b ), V ρt ( b ) = ρ ( b ) + (cid:88) z P r ( z | b, π ρ ( b )) V ρt − ( b π ρ ( b ) , z ) , (38)where π n IR ( b ) denotes the normal action of the tuplespecified by π IR ( b ) and: P r ( z | b, π ρ ( b )) = (cid:80) s (cid:80) s (cid:48)(cid:48) O ρ ( s (cid:48)(cid:48) , π ρ ( b ) , z ) T ρ ( s, π ρ ( b ) , s (cid:48)(cid:48) ) b ( s ) . (39)From the reduction procedure, we can replace T ρ and O ρ and π ρ ( b ) with their POMDP-IR counterparts: P r ( z | b, π ρ ( b )) = (cid:80) s (cid:80) s (cid:48)(cid:48) O IR ( s (cid:48)(cid:48) , π n IR ( b ) , z ) T IR ( s, π n IR ( b ) , s (cid:48)(cid:48) ) b ( s ) = P r ( z | b, π IR ( b )) . Similarly, for the belief update equation, b π ρ ( b ) , z = O ρ ( s (cid:48) , π ρ ( b ) , z ) P r ( z | π ρ ( b ) , b ) (cid:88) s b ( s ) T ρ ( s, π ρ ( b ) , s (cid:48) )= O IR ( s (cid:48) , π n IR ( b ) , z ) P r ( z | π n IR ( b ) , b ) (cid:88) s b ( s ) T IR ( s, π n IR ( b ) , s (cid:48) )= b π IR ( b ) , z . (40)Substituting the above result in (38) yields: V ρt ( b ) = ρ ( b ) + (cid:88) z P r ( z | b, π IR ( b )) V IR t − ( b π IR ( b ) , z ) . (41)Since the inductive assumption tells usthat V ρt − ( b ) = V IR t − ( b ) and (37) shows thatmax a p (cid:80) s b ( s ) R ( s, a p ) = ρ ( b ): V ρt ( b ) = [max a p (cid:88) s b ( s ) R ( s, a p )+ (cid:88) z P r ( z | b, π IR ( b )) V IRt − ( b π IR ( b ) , z )]= V IRt ( b ) . (cid:117)(cid:116) B G . Lemma 1
If for all b , ρ ( b ) ≥ , V πt ( b ) ≥ (1 − (cid:15) ) V ∗ t ( b ) , (42) and Q πt ( b, a ) is non-negative, monotone, and submodu-lar in a , then, for (cid:15) ∈ [0 , , ( B G V πt )( b ) ≥ (1 − e − )(1 − (cid:15) )( B G V ∗ t )( b ) . (43) Proof
Starting from (42) and, for a given a , on bothsides multiplying γ ≥
0, taking the expectation over z ,and adding ρ ( b ) (since ρ ( b ) ≥ (cid:15) ≤ ρ ( b )+ γ E z | b, a [ V πt ( b a, z )] ≥ (1 − (cid:15) )( ρ ( b )+ γ E z | b, a [ V ∗ t ( b a, z )]) . From the definition of Q πt (3), we thus have: Q πt +1 ( b, a ) ≥ (1 − (cid:15) ) Q ∗ t +1 ( b, a ) ∀ a . (44)From Theorem 3, we know Q πt +1 ( b, a Gπ ) ≥ (1 − e − ) Q πt +1 ( b, a ∗ π ) , (45)where a Gπ = greedy - argmax ( Q πt +1 ( b, · ) , A + , K ) and a ∗ π = argmax a Q πt +1 ( b, a ). Since Q πt +1 ( b, a ∗ π ) ≥ Q πt +1 ( b, a ) for any a , Q πt +1 ( b, a Gπ ) ≥ (1 − e − ) Q πt +1 ( b, a G ∗ ) , (46) xploiting Submodular Value Functions for Scaling Up Active Perception 21 where a G ∗ = greedy - argmax ( Q ∗ t ( b, · ) , A + , K ). Finally,(44) implies that Q πt +1 ( b, a G ∗ ) ≥ (1 − (cid:15) ) Q ∗ t +1 ( b, a G ∗ ), so: Q πt +1 ( b, a Gπ ) ≥ (1 − e − )(1 − (cid:15) ) Q ∗ t +1 ( b, a G ∗ )( B G V πt )( b ) ≥ (1 − e − )(1 − (cid:15) )( B G V ∗ t )( b ) . (47) (cid:117)(cid:116) Using Corollary 1 and Lemma 1, we can prove The-orem 4.
Theorem 4
If for all policies π , Q πt ( b, a ) is non-negative, monotone and submodular in a , then for all b , V Gt ( b ) ≥ (1 − e − ) t V ∗ t ( b ) . (48) Proof
By induction on t . The base case, t = 0, holdsbecause V G ( b ) = ρ ( b ) = V ∗ ( b ).In the inductive step, for all b , we assume that V Gt − ( b ) ≥ (1 − e − ) t − V ∗ t − ( b ) , (49)and must show that V Gt ( b ) ≥ (1 − e − ) t V ∗ t ( b ) . (50)Applying Lemma 1 with V πt = V Gt − and (1 − (cid:15) ) =(1 − e − ) t − to (49):( B G V Gt − )( b ) ≥ (1 − e − ) t − (1 − e − )( B G V ∗ t − )( b ) V Gt ( b ) ≥ (1 − e − ) t − ( B G V ∗ t − )( b ) . Now applying Corollary 1 with V πt − = V ∗ t − : V Gt ( b ) ≥ (1 − e − ) t − (1 − e − )( B ∗ V ∗ t − )( b ) V Gt ( b ) ≥ (1 − e − ) t V ∗ t ( b ) . (51) (cid:117)(cid:116) Q πt ( b, a ) is submodular in a requires threesteps. First, we show that G πk ( b t , a t ) equals the condi-tional entropy of b k over s k given z t : k and a t . Second,we show that, under certain conditions, conditional en-tropy is a submodular set function. Third, we combinethese two results to show that Q πt ( b, a ) is submodular. Lemma 2 If ρ ( b ) = − H b ( s ) , then the expected rewardat each time step equals the negative discounted condi-tional entropy of b k over s k given z t : k : G πk ( b t , a t ) = − γ t − k ( H b k ( s k | z t : k , a t ))= − γ t − k ( H a t b k ( s k | z t : k )) ∀ π. Proof
To prove the above lemma, we take help of someadditional notations and definitions, first we must elab-orate on the definition of b k : b k ( s k ) (cid:44) P r ( s k | b t , a t , π, z t : k ) = P r ( z t : k ,s k | b t , a t ,π ) P r ( z t : k | b t , a t ,π ) . (52)For notational convenience, we also write this as: b k ( s k ) (cid:44) P r πb t , a t ( z t : k , s k ) P r πb t , a t ( z t : k ) . (53)The entropy of b k is thus: H b k ( s k ) = (cid:88) s k P r πb t , a t ( z t : k , s k ) P r πb t , a t ( z t : k ) log( P r πb t , a t ( z t : k , s k ) P r πb t , a t ( z t : k ) ) , and the conditional entropy of b k over s k given z t : k is: H a t b k ( s k | z t : k ) = (cid:80) s k (cid:80) z t : k P r πb t , a t ( z t : k , s k ) log( Pr πbt, a t ( z t : k ,s k ) Pr πbt, a t ( z t : k ) ) . Then, by definition of G πk ( b t , a t ), G πk ( b t , a t ) = γ ( t − k ) ( − (cid:88) z t : k P r πb t , a t ( z t : k ) H b k ( s k ))By definition of entropy, = γ t − k (cid:80) z t : k P r πb t , a t ( z t : k ) (cid:34) (cid:80) s k Pr πbt, a t ( z t : k ,s k ) Pr πbt, a t ( z t : k ) log( Pr πbt, a t ( z t : k ,s k ) Pr πbt, a t ( z t : k ) ) (cid:35) = γ t − k (cid:88) z t : k (cid:34) (cid:88) s k P r πb t , a t ( z t : k , s k ) log( P r πb t , a t ( z t : k , s k ) P r πb t , a t ( z t : k ) ) (cid:35) By definition of conditional entropy,= γ t − k ( − H a t b k ( s k | z t : k )) . (cid:117)(cid:116) Lemma 3 If z is conditionally independent given s then − H ( s | z ) is submodular in z , i.e., for any two ob-servations z M and z N , H ( s | z M ∪ z N )+ H ( s | z M ∩ z N ) ≥ H ( s | z M )+ H ( s | z N ) . (54) Proof
By Bayes’ rule for conditional entropy (Coverand Thomas, 1991): H ( s | z M ∪ z N ) = H ( z M ∪ z N | s ) + H ( s ) − H ( z M ∪ z N ) . (55)Using conditional independence, we know H ( z M ∪ z N | s ) = H ( z M | s ) + H ( z N | s ). Substituting this in (55),we get: H ( s | z M ∪ z N ) = H ( z M | s ) + H ( z N | s ) + H ( s ) − H ( z M ∪ z N ) . (56)By Bayes’ rule for conditional entropy: H ( s | z M ∩ z N ) = H ( z M ∩ z N | s )+ H ( s ) − H ( z M ∩ z N ) . (57) Adding (55) and (57): H ( s | z M ∩ z N ) + H ( s | z M ∪ z N ) = H ( z M | s ) + H ( z N | s )+ H ( z M ∩ z N | s ) + 2 H ( s ) − H ( z M ∪ z N ) − H ( z M ∩ z N ) . (58)By Bayes’ rule for conditional entropy: H ( z M | s ) = H ( s | z M ) + H ( z M ) − H ( s ) , and H ( z N | s ) = H ( s | z N ) + H ( z N ) − H ( s ) (59)Substituting H ( z M | s ) and H ( z N | s ) in (58): H ( s | z M ∩ z N ) + H ( s | z M ∪ z N ) = H ( s | z M ) + H ( s | z N )+ H ( z M ∩ z N | s ) + [ H ( z M )+ H ( z N ) − H ( z M ∪ z N ) − H ( z M ∩ z N )] . Since entropy is submodular [ H ( z M ) + H ( z N ) − H ( z M ∪ z N ) − H ( z M ∩ z N )] is positive and since entropyis positive, H ( z M ∩ z N | s ) is positive. Thus, H ( s | z M ∩ z N ) + H ( s | z M ∪ z N ) = H ( s | z M ) + H ( s | z N )+ a positive term . This implies H ( s | z M ∪ z N )+ H ( s | z M ∩ z N ) ≥ H ( s | z M )+ H ( s | z N ). (cid:117)(cid:116) Lemma 4 If z t : k is conditionally independent given s k and ρ ( b ) = − H b ( s ) , then G πk ( b t , a t ) is submodular in a t ∀ π .Proof Let a tM and a tN be two actions and z t : kM and z t : kN the observations they induce. Then, from Lemma 2, G πk ( b t , a tM ) = γ ( t − k ) ( − H a t b k ( s k | z t : kM )) . (60)From Lemma 3, H a t b k ( s k | z t : kM ∪ z t : kN ) + H a t b k ( s k | z t : kM ∩ z t : kN ) ≥ H a t b k ( s k | z t : kM ) + H a t b k ( s k | z t : kN )Multiplying by − γ t − k on both sides andusing definition of GG πk ( b t , a tM ∪ a tN ) + G πk ( b t , a tN ∩ a tM ) ≤ G πk ( b t , a tM ) + G πk ( b t , a tN ) . (cid:117)(cid:116) Theorem 5 If z t : k is conditionally independent given s k and ρ ( b ) = − H b ( s ) , then Q πt ( b, a ) is submodular in a , for all π . Proof ρ ( b ) is trivially submodular in a because itis independent of a . Furthermore, Lemma 4 showsthat G πk ( b t , a t ) is submodular in a t . Since a positivelyweighted sum of submodular functions is also sub-modular (Krause and Golovin, 2014), this implies that (cid:80) t − k =1 G πk ( b t , a t ) and thus Q πt ( b, a ) are also submodularin a . (cid:117)(cid:116) Lemma 5 If V πt is convex over the belief space for all t , then Q πt ( b, a ) is monotone in a , i.e., for all b and a M ⊆ a N , Q πt ( b, a M ) ≤ Q πt ( b, a N ) .Proof By definition of Q πt ( b, a ), Q πt ( b, a M ) = [ ρ ( b ) + γ E z M [ V πt − ( b a M , z M ) | b, a M ]] . (61)Since ρ ( b ) is independent of a M , we need only show thatthe second term is monotone in a . Let a P = a N \ a M and F πb ( a N ) = E z N [ V πt − ( b a N , z N ) || b, a N ] . (62)Since a N = { a M ∪ a P } , F πb ( a N ) = E { z M , z P } [ V πt − ( b { a M , a P } , { z M , z P } ) | b, { a M , a P } ] . Separating expectations, F πb ( a N ) = E z M [ E z P [ V πt − ( b { a M , a P } , { z M , z P } ) | b, a P ] | b, a M ]Applying Jensen’s inequality, since V πt − is convex, F πb ( a N ) ≥ E z M [ V πt − ( E z P [ b a M , a P , z M , z P | b, a P ]) || b, a M ]Since the expectation of the posterior is the prior, F πb ( a N ) ≥ E z M [ V πt − ( b a M , z M ) | b, a M ] F πb ( a N ) ≥ F πb ( a M ) . (63)Consequently, we have: ρ ( b ) + γ t − k F πb ( a N ) ≥ ρ ( b ) + γ t − k F πb ( a M ) Q πt ( b, a N ) ≥ Q πt ( b, a M ) . (64) Theorem 6 If z t : k is conditionally independent given s k , V πt is convex over the belief space for all t, π , and ρ ( b ) = − H b ( s ) + log ( | S | ) , then for all b , V Gt ( b ) ≥ (1 − e − ) t V ∗ t ( b ) . (65) Proof
Follows from Theorem 4, given Q Gt ( b, a ) is non-negative, monotone and submodular. For ρ ( b ) = − H b ( s ) + log( | S | ), it is easy to see that Q Gt ( b, a ) isnon-negative, as entropy is always positive (Cover andThomas, 1991) and is maximum when b ( s ) = | S | forall s (Cover and Thomas, 1991). Theorem 5 showedthat Q Gt ( b, a ) is submodular if ρ ( b ) = − H b ( s ). Themonotonicity of Q Gt follows from the condition that V πt is convex in belief space; Lemma 5 then shows that Q Gt ( b, a ) is monotone in a . (cid:117)(cid:116) xploiting Submodular Value Functions for Scaling Up Active Perception 23 Lemma 6
For all beliefs b , the error between V Gt ( b ) and ˜ V Gt ( b ) is bounded by Cδ α − γ . That is, || V Gt − ˜ V Gt || ∞ ≤ Cδ α − γ .Proof Follows exactly the strategy by Araya-L´opez et al(2010) used to prove (27), which places no conditionson π and thus holds as long as B G is a contractionmapping. Since for any policy the Bellman operator B π defined as:( B π V t − )( b ) = [ ρ ( b, a π ) + γ (cid:88) z ∈ Ω Pr( z | a π , b ) V t − ( b a π , z )] , is a contraction mapping (Bertsekas, 2007), the boundholds for ˜ V Gt . (cid:117)(cid:116) Let η = Cδ α − γ and ˜ Q ∗ t ( b, a ) = ˜ ρ ( b ) + (cid:80) z Pr( z | b, a ) ˜ V ∗ t − ( b a , z ) denote the value of taking ac-tion a in belief b under an optimal policy. Let ˜ Q Gt ( b, a ) =˜ ρ ( b ) + (cid:80) z Pr( z | b, a ) ˜ V Gt − ( b a , z ) be the action-value func-tion computed by greedy PBVI with immediate rewardbeing ˜ ρ ( b ). Also, let˜ Q πt ( b, a ) = ˜ ρ ( b ) + (cid:88) z Pr( z | b, a ) ˜ V πt − ( b a , z ) , ˜ V πt ( b ) = ˜ ρ ( b ) + (cid:88) z Pr( z | b, a π ) ˜ V πt − ( b a π , z ) , (66)denote the value function for a given policy π , when thebelief based reward is ˜ ρ ( b ). As mentioned before, it isnot guaranteed that ˜ Q Gt ( b, a ) is submodular. Instead,we show that it is (cid:15) -submodular : Definition 6
The set function f ( a ) is (cid:15) -submodular in a , if for every a M ⊆ a N ⊆ A + , a e ∈ A + \ a N and (cid:15) ≥ f ( a e ∪ a M ) − f ( a M ) ≥ f ( a e ∪ a N ) − f ( a N ) − (cid:15). Lemma 7 If || V πt − − ˜ V πt − || ∞ ≤ η , and Q πt ( b, a ) is sub-modular in a , then ˜ Q πt ( b, a ) is (cid:15) (cid:48) -submodular in a for all b , where (cid:15) (cid:48) = 4( γ + 1) η .Proof Since, || V πt − − ˜ V πt − || ∞ ≤ η , then for all beliefs b , V πt − ( b ) − ˜ V πt − ( b ) ≤ η, (67)For a given a , on both sides multiply γ ≥
0, take theexpectation over z and since ρ ( b ) − ˜ ρ ( b ) ≤ η , , ρ ( b ) − ˜ ρ ( b ) + γ E z | b, a V πt − ( b ) − γ E z | b, a ˜ V πt − ( b ) ≤ γη + η Therefore for all b , a , Q πt ( b, a ) − ˜ Q πt ( b, a ) ≤ ( γ + 1) η (68) Now since Q πt ( b, a ) is submodular, it satisfies the fol-lowing equation, Q πt ( b, a e ∪ a M ) − Q πt ( b, a M ) ≥ Q πt ( b, a e ∪ a N ) − Q πt ( b, a N ) , (69)for every a M ⊆ a N ⊆ A + , a e ∈ A + \ a N For each actionthat appear in (69), that is, { a e ∪ a M } , a M , { a e ∪ a N } and a N , the value computed by ˜ Q πt for belief b will be anapproximation to Q πt . Thus the inequality in (69) thatholds for Q πt , may not hold for ˜ Q πt . The worst case pos-sible is, for some combination of b, { a e ∪ a M } , a M , { a e ∪ a N } , ˜ Q πt ( b, a e ∪ a M ) and Q πt ( b, a N ) underestimates thetrue value of Q πt ( b, a e ∪ a M ) and ˜ Q πt ( b, a N ) by ( γ + 1) η each and ˜ Q πt ( b, a M ) and ˜ Q πt ( b, a e ∪ a N ) overestimatesthe value of Q πt ( b, a M ) and Q πt ( b, a e ∪ a N ) by ( γ + 1) η each. This can be written formally as: ˜ Q πt ( b, a e ∪ a M ) − ˜ Q πt ( b, a M ) ≥ ˜ Q πt ( b, a e ∪ a N ) − ˜ Q πt ( b, a N ) − γ + 1) η. (cid:117)(cid:116) Lemma 8 If ˜ Q πt ( b, a ) is non-negative, monotone and (cid:15) -submodular in a , then ˜ Q πt ( b, a G ) ≥ (1 − e − ) ˜ Q πt ( b, a ∗ ) − χ K (cid:15), (70) where χ K = (cid:80) K − p =0 (1 − K − ) p .Proof Let a ∗ be the optimal set of action fea-tures of size K , a ∗ = argmax a ˜ Q πt ( b, a ) and let a l be the greedily selected set of size l , that is, a l = greedy - argmax ( ˜ Q πt ( b, · ) , A + , l ) Also, let a ∗ = { a ∗ . . . a ∗ K } be the elements of set a ∗ . Then,By monotonicity of ˜ Q πt ( b, a )˜ Q πt ( b, a ∗ ) ≤ ˜ Q πt ( b, a ∗ ∪ a l )Re-writing as a telescoping sum= ˜ Q πt ( b, a l ) + K (cid:88) j =1 ∆ ˜ Q b ( a ∗ j | a l ∪ { a ∗ . . . a ∗ j − } )Using Lemma 7, since Q is (cid:15) (cid:48) -submodular ≤ ˜ Q πt ( b, a l ) + K (cid:88) j =1 ∆ ˜ Q b ( a ∗ j | a l ) + 4 K(cid:15) As a l +1 is built greedily from a l in order to maximize ∆ ˜ Q b ≤ ˜ Q πt ( b, a l ) + K (cid:88) j =1 ( ˜ Q πt ( b, a l +1 ) − ˜ Q πt ( b, a l )) + 4 K(cid:15) As | a ∗ | = K = ˜ Q πt ( b, a l ) + K ( ˜ Q πt ( b, a l +1 ) − ˜ Q πt ( b, a l )) + 4 K(cid:15)
Let δ l := ˜ Q πt ( b, a ∗ ) − ˜ Q πt ( b, a l ), which allows us torewrite above equation as: δ l ≤ K ( δ l − δ l +1 ) + 4 K(cid:15) .Hence, δ l +1 ≤ (1 − K ) δ l + 4 (cid:15) . Using this relation recur-sively, we can write, δ K ≤ (1 − K ) K δ + 4 (cid:80) K − p =0 (1 − K ) p (cid:15) . Also, δ = ˜ Q πt ( b, a ∗ ) − ˜ Q πt ( b, a ) and usingthe inequality 1 − x ≤ e − x , we can write δ K ≤ e − KK ˜ Q πt ( b, a ∗ ) + 4 (cid:80) K − p =0 (1 − K − ) (cid:15) . Substituting δ K and rearranging terms (Also χ K = (cid:80) K − p =0 (1 − K ) p ):˜ Q πt ( b, a G ) ≥ (1 − e − ) ˜ Q πt ( b, a ∗ ) − χ K (cid:15). (cid:117)(cid:116) Theorem 7
For all beliefs, the error between ˜ V Gt ( b ) and ˜ V ∗ t ( b ) is bounded, if ρ ( b ) = − H b ( s ) , V πt is convexin the belief space for all π, t , and if z t : k is conditionallyindependent given s k .Proof Theorem 6 shows that, if ρ ( b ) = − H b ( s ), and z t : k is conditionally independent given s k , then Q Gt ( b, a ) issubmodular. Using Lemma 7, for V πt = V Gt , ˜ V πt = ˜ V Gt , Q πt ( b, a ) = Q Gt ( b, a ) and ˜ Q πt ( b, a ) = ˜ Q Gt ( b, a ), it is easyto see that ˜ Q Gt ( b, a ) is (cid:15) -submodular. This satisfies onecondition of Lemma 8. Given that ˜ V Gt ( b ) is convex, themonotonicity of ˜ Q Gt ( b, a ) follows from Lemma 5. Since˜ ρ ( b ) is non-negative, ˜ Q Gt ( b, a ) is non-negative too. Nowwe can apply Lemma 9 to prove that the error gener-ated by a one-time application of the greedy Bellmanoperator to ˜ V Gt ( b ), instead of the Bellman optimalityoperator, is bounded. It is thus easy to see that the er-ror between ˜ V Gt ( b ), produced by multiple applicationsof the greedy Bellman operator, and ˜ V ∗ t ( b ) is boundedfor all beliefs. (cid:117)(cid:116) Acknowledgements
We thank Henri Bouma and TNO forproviding us with the dataset used in our experiments. Wealso thank the STW User Committee for its advice regard-ing active perception for multi-camera tracking systems. Thisresearch is supported by the Dutch Technology FoundationSTW (project
References
Araya-L´opez M, Thomas V, Buffet O, Charpillet F(2010) A POMDP extension with belief-dependentrewards. In: Advances in Neural Information Process-ing Systems, pp 64–72Astr¨om KJ (1965) Optimal control of Markov decisionprocesses with incomplete state estimation. Journalof Mathematical Analysis and Applications pp 174—-205Bajcsy R (1988) Active perception. Proceedings of theIEEE 76(8):966–1005Bertsekas DP (2007) Dynamic Programming and Op-timal Control, vol II, 3rd edn. Athena ScientificBonet B, Geffner H (2009) Solving pomdps: Rtdp-belvs. point-based algorithms. In: Proceedings of theTwenty-First International Jont Conference on Ar-tifical Intelligence, IJCAI’09, pp 1641–1646 Bouma H, Baan J, Landsmeer S, Kruszynski C, vanAntwerpen G, Dijk J (2013) Real-time tracking andfast retrieval of persons in multiple surveillance cam-eras of a shopping mall. In: SPIE Defense, Security,and SensingBurgard W, Fox D, Thrun S (1997) Active mobile robotlocalization by entropy minimization. In: Proceedingsof the Second EUROMICRO Workshop on AdvancedMobile Robots 1997, IEEE, pp 155–162Chen Y, Javdani S, Karbasi A, Bagnell JA, SrinivasaS, Krause A (2015) Submodular surrogates for valueof information. In: Proceedings of the Twenty-NinthAAAI Conference on Artificial Intelligence, pp 3511–3518Cheng HT (1988) Algorithms for partially observablemarkov decision processes. PhD thesis, University ofBritish ColumbiaCover TM, Thomas JA (1991) Entropy, relative entropyand mutual information. In: Elements of InformationTheory, Wiley, pp 12–49Dollar P, Belongie S, Perona P (2010) The fastestpedestrian detector in the west. In: Proceedings ofthe British Machine Vision Conference, BMVA Press,pp 68.1–68.11Eck A, Soh LK (2012) Evaluating pomdp rewards foractive perception. In: Proceedings of the Eleventh In-ternational Conference on Autonomous Agents andMultiagent Systems, pp 1221–1222Fisher ML, Nemhauser GL, Wolsey LA (1978) An anal-ysis of approximations for maximizing submodularset functions—II. SpringerGilbarg D, Trudinger N (2001) Elliptic Partial Differ-ential Equations of Second Order. U.S. GovernmentPrinting OfficeGolovin D, Krause A (2011) Adaptive submodular-ity: Theory and applications in active learning andstochastic optimization. Journal of Artificial Intelli-gence Research (JAIR) pp 427–486Hauskrecht M (2000) Value-function approximationsfor partially observable markov decision processes.Journal of Artificial Intelligence Research pp 33–94Ji S, Parr R, Carin L (2007) Nonmyopic multiaspectsensing with partially observable Markov decisionprocesses. IEEE Transactions on Signal Processingpp 2720–2730Joshi S, Boyd S (2009) Sensor selection via convex op-timization. IEEE Transactions on Signal Processingpp 451–462Kaelbling LP, Littman ML, Cassandra AR (1998) Plan-ning and acting in partially observable stochastic do-mains. Artificial Intelligence pp 99–134Kochenderfer MJ (2015) Decision Making Under Un-certainty: Theory and Application. MIT Press xploiting Submodular Value Functions for Scaling Up Active Perception 25
Krause A, Golovin D (2014) Submodular function max-imization. In: Tractability: Practical Approaches toHard Problems, Cambridge University PressKrause A, Guestrin C (2005) Optimal nonmyopic valueof information in graphical models - efficient algo-rithms and theoretical limits. In: Proceedings of theNineteenth International Joint Conference on Artifi-cial Intelligence, pp 1339–1345Krause A, Guestrin C (2007) Near-optimal observationselection using submodular functions. In: Proceed-ings of the Twenty-Second AAAI Conference on Ar-tificial Intelligence, pp 481–492Krause A, Guestrin C (2009) Optimal value of informa-tion in graphical models. Journal of Artificial Intelli-gence Research pp 557–591Kreucher C, Kastella K, Hero AO III (2005) Sensormanagement using an active sensing approach. SignalProcessing pp 607–624Krishnamurthy V, Djonin DV (2007) Structuredthreshold policies for dynamic sensor scheduling—apartially observed Markov decision process approach.IEEE Transactions on Signal ProcessingKumar A, Zilberstein S (2009) Event-detecting multi-agent MDPs: Complexity and constant-factor ap-proximation. In: Proceedings of the Twenty-FirstInternational Joint Conference on Artificial Intelli-gence, pp 201–207Kurniawati H, Hsu D, Lee WS (2008) Sarsop: Efficientpoint-based pomdp planning by approximating op-timally reachable belief spaces. In: In ProceedingsRobotics: Science and SystemsKurniawati H, Du Y, Hsu D, Lee WS (2010) Mo-tion planning under uncertainty for robotic taskswith long time horizons. The International Journalof Robotics ResearchLittman ML (1996) Algorithms for sequential decisionmaking. PhD thesis, Brown UniversityLovejoy WS (1991) Computationally feasible boundsfor partially observed Markov decision processes. Op-erations Research pp 162–175Monahan GE (1982) A survey of partially observableMarkov decision processes: Theory, models, and al-gorithms. Management Science pp 1–16Natarajan P, Hoang TN, Low KH, Kankanhalli M(2012) Decision-theoretic approach to maximizingobservation of multiple targets in multi-camerasurveillance. In: Proceedings of the 11th InternationalConference on Autonomous Agents and MultiagentSystems, pp 155–162Nemhauser G, Wolsey L, Fisher M (1978) An analysis ofapproximations for maximizing submodular set func-tions—i. Mathematical Programming pp 265–294 Oliehoek FA, Whiteson S, Spaan MTJ (2013) Approxi-mate solutions for factored Dec-POMDPs with manyagents. In: Proceedings of the Twelfth InternationalJoint Conference on Autonomous Agents and Multi-agent Systems, pp 563–570Pineau J, Gordon GJ (2007) Pomdp planning for robustrobot control. In: Robotics Research, Springer, pp69–82Pineau J, Gordon GJ, Thrun S (2006) Anytime point-based approximations for large POMDPs. Journal ofArtificial Intelligence Research pp 335–380Poupart P (2005) Exploiting structure to efficientlysolve large scale partially observable markov decisionprocesses. PhD thesis, University of TorontoRaphael C, Shani G (2012) The skyline algorithm forpomdp value function pruning. Annals of Mathemat-ics and Artificial Intelligence 65(1):61–77Ross S, Pineau J, Paquet S, Chaib-Draa B (2008) On-line planning algorithms for pomdps. Journal of Ar-tificial Intelligence Research pp 663–704Satsangi Y, Whiteson S, Oliehoek F (2015) Exploitingsubmodular value functions for faster dynamic sensorselection. In: AAAI 2015: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp3356–3363Shani G, Pineau J, Kaplow R (2012) A survey of point-based pomdp solvers. Autonomous Agents and Multi-Agent Systems pp 1–51Silver D, Veness J (2010) Monte-carlo planning in largepomdps. In: Advances in Neural Information Process-ing Systems, pp 2164–2172Smallwood RD, Sondik EJ (1973) The optimal controlof partially observable Markov processes over a finitehorizon. Operations Research pp 1071–1088Sondik EJ (1971) The Optimal Control of PartiallyObservable Markov Processes. PhD thesis, StanfordUniversity, United States – CaliforniaSpaan MTJ (2008) Cooperative active perception usingPOMDPs. In: AAAI Conference on Artificial Intelli-gence 2008: Workshop on Advancements in POMDPSolversSpaan MTJ (2012) Partially observable Markov deci-sion processes. In: Wiering M, van Otterlo M (eds)Reinforcement Learning: State of the Art, SpringerVerlag, pp 387–414Spaan MTJ, Lima PU (2009) A decision-theoretic ap-proach to dynamic sensor selection in camera net-works. In: International Conference on AutomatedPlanning and Scheduling, pp 279–304Spaan MTJ, Vlassis N (2005) Perseus: Randomizedpoint-based value iteration for POMDPs. Journal ofArtificial Intelligence Research 24:195–2206 Satsangi et al.