POMDP Structural Results for Controlled Sensing
aa r X i v : . [ c s . S Y ] J a n POMDP Structural Results for ControlledSensing
Vikram Krishnamurthy
I. I
NTRODUCTION
Structural results for POMDPs are important since solving POMDPs numerically are typically intractable. Solvinga classical POMDP is known to be PSPACE-complete [40]. Moreover, in controlled sensing problems [16], [26],[10], it is often necessary to use POMDPs that are nonlinear in the belief state in order to model the uncertaintyin the state estimate. (For example, the variance of the state estimate is a quadratic function of the belief.) In suchcases, there is no finite dimensional characterization of the optimal POMDP policy even for a finite horizon.The seminal papers [35], [43], [44] give sufficient conditions on the costs, transition provabilities and observationprobabilities so that the value function of a POMDP is monotone with respect to the monotone likelihood ratio(MLR) order (and more generally the multivariate TP2 order). These papers then use this monotone result to showthat the optimal policy can be lower bounded by a myopic policy. Our recent works [20], [28] relax the conditionson the transition matrix to construct myopic lower and upper bounds.II. T HE P ARTIALLY O BSERVED M ARKOV D ECISION P ROCESS
For notational convenience, we consider a discrete time, infinite horizon discounted cost POMDP. A discretetime Markov chain evolves on the state space X = { , , . . . , X } . Denote the action space as U = { , , . . . , U } and observation space as Y . For discrete-valued observations Y = { , , . . . , Y } and for continuous observations Y ⊂ IR .Let Π( X ) = n π : π ( i ) ∈ [0 , , P Xi =1 π ( i ) = 1 o denote the belief space of X -dimensional probability vectors.For stationary policy µ : Π( X ) → U , initial belief π ∈ Π( X ) , discount factor ρ ∈ [0 , , define the discountedcost: J µ ( π ) = E µ ( ∞ X k =0 ρ k c ′ µ ( π k ) π k ) . (1)Here c u = [ c (1 , u ) , . . . , c ( X, u )] ′ , u ∈ U is the cost vector for each action, and the belief state evolves as π k = T ( π k − , y k , u k ) where T ( π, y, u ) = B y ( u ) P ′ ( u ) πσ ( π, y, u ) , σ ( π, y, u ) = ′ X B y ( u ) P ′ ( u ) π, B y ( u ) = diag { B ,y ( u ) , · · · , B X,y ( u ) } . (2) V. Krishnamurthy is with the Department of Electrical and Computer Engineering, Cornell University, USA. (email: [email protected]).
January 3, 2017 DRAFT
Here X represents a X -dimensional vector of ones, P ( u ) = [ P ij ( u )] X × X P ij ( u ) = P ( x k +1 = j | x k = i, u k = a ) denote the transition probabilities, B xy ( u ) = P ( y k +1 = y | x k +1 = x, u k = u ) when Y is finite, or B xy ( u ) is theconditional probability density function when Y ⊂ IR .The aim is to compute the optimal stationary policy µ ∗ : Π( X ) → U such that J µ ∗ ( π ) ≤ J µ ( π ) for all π ∈ Π( X ) . Obtaining the optimal policy µ ∗ is equivalent to solving Bellman’s dynamic programming equation: µ ∗ ( π ) = argmin u ∈ U Q ( π, u ) , J µ ∗ ( π ) = V ( π ) , where V ( π ) = min u ∈ U Q ( π, u ) , Q ( π, u ) = c ′ u π + ρ X y ∈ Y V ( T ( π, y, u )) σ ( π, y, u ) . (3)Since Π( X ) is continuum, Bellman’s equation (3) does not translate into practical solution methodologies as thevalue function V ( π ) needs to be evaluated at each π ∈ Π( X ) . A. POMDPs in Controlled Sensing
In controlled sensing, to incorporate uncertainty of the state estimate, we generalize the above POMDP to considercosts that are nonlinear in the belief. Consider the following instantaneous cost at each time k : c ( x k , u k ) + d ( x k , π k , u k ) , u k ∈ U = { , , . . . , U } . (i) Sensor Usage Cost : c ( x k , u k ) denotes the instantaneous cost of using sensor u k at time k when the Markovchain is in state x k .(ii) Sensor Performance Loss : d ( x k , π k , u k ) models the performance loss when using sensor u k . This loss is modeledas an explicit function of the belief state π k to capture the uncertainty in the state estimate.Typically there is trade off between the sensor usage cost and performance loss. Accurate sensors have highusage cost but small performance loss.Then in terms of the belief state, the instantaneous cost can be expressed as C ( π k , u k ) = E { c ( x k , u k ) + d ( x k , π k , u k ) |I k } = c ′ u k π k + D ( π k , u k ) , where c u = ( c ( u, , . . . , c ( u, X )) ′ ,D ( π k , u k ) defn = E { d ( x k , π k , u k ) |I k } = X X i =1 d ( i, π k , u k ) π k ( i ) . (4)Define the controlled sensing objective J µ ( π ) = E µ ( ∞ X k =0 ρ k D ( π k , u k ) ) . (5)In controlled sensing, the aim is to compute the optimal stationary policy µ ∗ : Π( X ) → U such that J µ ∗ ( π ) ≤ J µ ( π ) for all π ∈ Π( X ) . Obtaining the optimal controlled sensing policy µ ∗ is equivalent to solving Bellman’sdynamic programming equation: µ ∗ ( π ) = argmin u ∈ U Q ( π, u ) , J µ ∗ ( π ) = V ( π ) , where V ( π ) = min u ∈ U Q ( π, u ) , Q ( π, u ) = C ( π, u ) + ρ X y ∈ Y V ( T ( π, y, u )) σ ( π, y, u ) . (6) January 3, 2017 DRAFT
B. Examples of Nonlinear Cost POMDP
The non-standard feature of the objective (5) is the nonlinear performance loss terms D ( π, u ) . These costs shouldbe chosen so that they are zero at the vertices e i of the belief space Π( X ) (reflecting perfect state estimation) andlargest at the centroid of the belief space (most uncertain estimate). We now discuss examples of d ( x, π, u ) and itsconditional expectation D ( π, u ) . (i). Piecewise Linear Cost : Here we choose the performance loss as d ( x, π, u ) = if k x − π k ∞ ≤ ǫǫ if ǫ ≤ k x − π k ∞ ≤ − ǫ if k x − π k ∞ ≥ − ǫ , ǫ ∈ [0 , . . (7)Then D ( π, u ) is piecewise linear and concave. This cost is useful for subjective decision making. e.g., the distanceof a target to a radar is quantized into three regions: close, medium and far. (ii). Mean Square, l and l ∞ Performance Loss : Suppose in (5) we choose d ( x, π, u ) = α ( u )( x − π ) ′ M ( x − π ) + β ( u ) , x ∈ { e , . . . , e X } , π ∈ Π . (8)Here M is a user defined positive semi-definite symmetric matrix, α ( u ) and β ( u ) , u ∈ U are user defined positivescalar weights that allow different sensors (sensing modes) to be weighed differently. So (8) is the squared errorof the Bayesian estimator (weighted by M , scaled by α ( u ) and translated by β ( u ) ). In terms of the belief state,the mean square performance loss (8) is D ( π k , u k ) = E { d ( x k , π k , u k ) |I k } = α ( u k ) (cid:0) X X i =1 M ii π k ( i ) − π ′ k M π k (cid:1) + β ( u k ) (9)because E { ( x k − π k ) ′ M ( x k − π k ) |I k } = P Xi =1 ( e i − π ) ′ M ( e i − π ) π ( i ) . The cost (9) is quadratic and concave inthe belief.Alternatively, if d ( x, π, u ) = k x − π k then D ( π, u ) = 2(1 − π ′ π ) is also quadratic in the belief. Also, choosing d ( x, π, u ) = k x − π k ∞ yields D ( π, u ) = (1 − π ′ π ) . (iii). Entropy based Performance Loss : Here we choose D ( π, u ) = − α ( u ) S X i =1 π ( i ) log π ( i ) + β ( u ) , π ∈ Π . (10)The intuition is that an inaccurate sensor with cheap usage cost yields a Bayesian estimate π with a higher entropycompared to an accurate sensor.III. S TRUCTURAL R ESULT
ONVEXITY OF V ALUE F UNCTION AND S TOPPING S ET Our first result is that the value function V ( π ) in (6) is concave in π ∈ Π( X ) . A linear function c ′ u π cannot attain its maximum at the centroid of a simplex since a linear function achieves it maximum at a boundarypoint. January 3, 2017 DRAFT
Theorem III.1.
Consider a POMDP with possibly continuous-valued observations. Assume that for each action u , the instantaneous cost C ( π, u ) are concave and continuous with respect to π ∈ Π( X ) . Then the value function V ( π ) is concave in π . The proof is given in [20, Chapter 8].
A. Convexity of Stopping Set for Stopping Time POMDPs with nonlinear cost
With the above concavity result we have the following important result for contolled sensing stopping timePOMDPs. A stopping time POMDP has action space U = { (stop) , (continue) } .The stop action u = 1 incurs a terminal cost of c ( x, u = 1) and the problem terminates.For continue action u = 2 , the state x ∈ X = { , , . . . , X } evolves with transition matrix P and is observedvia observations y with observation probabilities B xy = P ( y k = y | x k = x ) . An instantaneous cost c ( x, u = 2) isincurred. Thus for u = 2 , the belief state evolves according to the HMM filter π k = T ( π k − , y k ) . Since action 1 isa stop action and has no dynamics, to simplify notation, we write T ( π, y, as T ( π, y ) and σ ( π, y, as σ ( π, y ) in this subsection.For the stopping time POMDP, µ ∗ is the solution of Bellman’s equation which is of the form µ ∗ ( π ) = argmin u ∈ U Q ( π, u ) , V ( π ) = min u ∈ U Q ( π, u ) , (11) Q ( π,
1) = c ′ π, Q ( π,
2) = C ( π,
2) + ρ X y ∈ Y V ( T ( π, y )) σ ( π, y ) . where T ( π, y ) and σ ( π, y ) are the HMM filter and normalization (2).We now present the first structural result for stopping time POMDPs: the stopping region for the optimal policyis convex. Define the stopping set R as the set of belief states for which stopping ( u = 1 ) is the optimal action.Define R as the set of belief states for which continuing ( u = 2 ) is the optimal action. That is R = { π : µ ∗ ( π ) = 1 (stop) } , R = { π : µ ∗ ( π ) = 2 } = Π( X ) − R . (12)The theorem below shows that the stopping set R is convex (and therefore a connected set). Recall that thevalue function V ( π ) is concave on Π( X ) . Theorem III.2.
Consider the stopping-time POMDP with value function given by (11). Suppose that the possiblynonlinear cost C ( π, is concave in π . Then the stopping set R is a convex subset of the belief space Π( X ) .Proof: Pick any two belief states π , π ∈ R . To demonstrate convexity of R , we need to show for any January 3, 2017 DRAFT λ ∈ [0 , , λπ + (1 − λ ) π ∈ R . Since V ( π ) is concave, V ( λπ + (1 − λ ) π ) ≥ λV ( π ) + (1 − λ ) V ( π )= λQ ( π ,
1) + (1 − λ ) Q ( π , (since π , π ∈ R ) = Q ( λπ + (1 − λ ) π , (since Q ( π, is linear in π ) ≥ V ( λπ + (1 − λ ) π ) (since V ( π ) is the optimal value function)Thus all the inequalities above are equalities, and λπ + (1 − λ ) π ∈ R .The above theorem is a small extension of [34] which deals with case when the costs C ( π, are linear in π .The proof is exactly the same as in [34] – all that is required is that C ( π, is concave B. Example. Quickest Change Detection with Nonlinear Delay Cost
Quickest detection is a useful example of a stopping time POMDP that has applications in numerous areas[42], [2]. The classical Bayesian quickest detection problem is as follows: An underlying discrete-time stateprocess x jump changes at a geometrically distributed random time τ . Consider a sequence of random mea-surements { y k , k ≥ } , such that conditioned on the event { τ = t } , y k , { k ≤ t } are i.i.d. random variableswith distribution B y and { y k , k > t } are i.i.d. random variables with distribution B y . The quickest detectionproblem involves detecting the change time τ with minimal cost. That is, at each time k = 1 , , . . . , a decision u k ∈ { continue , stop and announce change } needs to be made to optimize a tradeoff between false alarm frequencyand linear delay penalty. A geometrically distributed change time τ is realized by a two state ( X = 2 ) Markov chain with absorbingtransition matrix P and prior π as follows: P = − P P , π = , τ = inf { k : x k = 1 } . (13)The system starts in state 2 and then jumps to the absorbing state 1 at time τ . Clearly τ is geometrically distributedwith mean / (1 − P ) .The cost criterion in classical quickest detection is the Kolmogorov–Shiryayev criterion for detection of disorder[46] J µ ( π ) = d E µ { ( τ − τ ) + } + P µ ( τ < τ ) , π = π. (14)where µ denotes the decision policy. The first term is the delay penalty in making a decision at time τ > τ and d is a positive real number. The second term is the false alarm penalty incurred in announcing a change at time τ < τ . There are two general formulations for quickest time detection. In the first formulation, the change point τ is an unknown deterministictime, and the goal is to determine a stopping rule such that a worst case delay penalty is minimized subject to a constraint on the false alarmfrequency (see, e.g., [37], [41], [51], [42]). The second formulation, which is the formulation considered in this book (this chapter and alsoChapter ?? ), is the Bayesian approach where the change time τ is specified by a prior distribution. January 3, 2017 DRAFT
Stopping time POMDP : The quickest detection problem with penalty (14) is a stopping time POMDP with U = { (announce change and stop) , (continue) } , X = { , } , transition matrix in (13), arbitrary observationprobabilities B xy , cost vectors c = [0 , ′ , c = [ d, ′ and discount factor ρ = 1 .In light of Theorem III.2, we can generalize this to delay costs C ( π, that are convex and nonlinear in the belief.For example such a cost could be motivated by the square error or entropy of the belief reflecting an inaccuratestate estimate. We have the following structural result. Corollary III.3.
The optimal policy µ ∗ for classical quickest detection has a threshold structure: There exists athreshold point π ∗ ∈ [0 , such that u k = µ ∗ ( π k ) = (continue) if π k (2) ∈ [ π ∗ , (stop and announce change) if π k (2) ∈ [0 , π ∗ ) . (15) Proof:
Since X = 2 , Π( X ) is the interval [0 , , and π (2) ∈ [0 , is the belief state. Theorem III.2 impliesthat the stopping set R is convex. In one dimension this implies that R is an interval of the form [ a ∗ , π ∗ ) for ≤ a < π ∗ ≤ . Since state 1 is absorbing, Bellman’s equation (11) with ρ = 1 applied at π = e implies µ ∗ ( e ) = argmin u { c (1 , u = 1) | {z } , d (1 − π (2)) + V ( e ) } = 1 . So e or equivalently π (2) = 0 belongs to R . Therefore, R is an interval of the form [0 , π ∗ ) . Hence the optimalpolicy is of the form (15).Theorem III.2 says that for quickest detection of a multi-state Markov chain, the stopping set R is convex forany concave non-linear delay cost. This is different to the result in [17] which considered a nonlinear stoppingcost (false alarm cost) - in [17] the stopping set was not necessarily convex. For additional results on controlledsampling with quickest detection see [19]. Social Learning
Social learning, or learning from the actions of others, is an integral part of human behavior and has been studiedwidely in behavioral economics, sociology, electrical engineering and computer science to model the interactionof decision makers [3], [1], [6], [9], [45], [50], [18], [29], [30]. POMDPs with social learning result in interestingbehaviour.Social learning models present unique challenges from a statistical signal processing point of view. First, agentsinteract with and influence each other. For example, ratings posted on online reputation systems strongly influencethe behavior of individuals. This is usually not the case with physical sensors. Second, agents (humans) lack thecapability to quickly absorb information and translate it into decisions. According to the paradigm of rationalinattention theory, pioneered by economics Nobel prize winner Sims [47], attention is a time-limited resourcethat can be modelled in terms of an information-theoretic channel capacity. Therefore, while apparently mistaken
January 3, 2017 DRAFT decisions are ubiquitous, this does not imply that decision makers are irrational. More recently for results inquickest detection POMDPs with social learning and risk averse agents please see [18], [22].
Remark : Of course, one of the best known examples of a stopping time problem is optimal search for a Markovtarget [8], [36], [48], [15]. Another interesting example is a multiple stopping problem [38], [21]; this has applicationsin interactive advertising in social multimedia like YouTube. The problem has distinct parallels to scheduling incommunication systems [39].IV. T
HE VALUE FUNCTION IS POSITIVELY HOMOGENOUS
Define the positive X -orthant as IR X + . On this positive orthant, define the relaxed belief state α . We can definethe following Bellman’s equation where W below denotes the value function with α ∈ IR X + . W ( α ) = min u ∈ U Q ( α, u ) , Q ( α, u ) = c ′ u α + ρ X y ∈ Y W ( T ( α, y, u )) σ ( α, y, u ) . (16)Clearly when α is restricted to the belief space (unit simplex) Π( X ) , then W ( α ) = V ( α ) . This can be establishedby mathematical induction (valued iteration) and the proof is omitted. We now have the following result. Theorem IV.1.
The relaxed value function W ( · ) of a linear cost POMDP is positively homogenous. That is, forany constant κ > , W ( κα ) = κW ( α ) . Therefore, (16) can be expressed as W ( α ) = min u ∈ U Q ( α, u ) , Q ( α, u ) = c ′ u α + ρ X y ∈ Y W ( B y ( u ) P ′ ( u ) α ) (17)The proof is straightforward since the cost c ′ u α and σ ( α, y, u ) are linear in α and T ( κα, y, µ ) = T ( α, y, µ ) .It is this positive homogeneity property of the value function and especially the representation (17) which allowsfor the finite horizon case to immediately show that the value function is piecewise linear and concave.V. M ONOTONE V ALUE F UNCTION
Definition V.1 (Monotone Likelihood Ratio (MLR) order ≥ r ) . Let π , π ∈ Π( X ) denote two beliefs. Then π dominates π with respect to the MLR order, denoted as π ≥ r π , if π ( i ) π ( j ) ≤ π ( i ) π ( j ) i < j , i, j ∈{ , . . . , X } . A function φ : Π( X ) → IR is said to be MLR increasing if π ≥ r π implies φ ( π ) ≥ φ ( π ) . (A1) C ( π, u ) is first order stochastic decreasing in π for each u ∈ U . (A2) P ( u ) , u ∈ U is totally positive of order 2 (TP2): all second-order minors are nonnegative. (A3) B ( u ) , u ∈ U is totally positive of order 2 (TP2). Theorem V.2.
Under A1, A2 and A3, the value function V ( π ) in (6) is MLR decreasing. The proof of the theorem is in [20], [25]. Limits on attention impact choice. For example, purchasers limit their attention to a relatively small number of websites when buying overthe internet; shoppers buy expensive products due to their failure to notice if sales tax is includes in the price [5].
January 3, 2017 DRAFT
Note that for 2 states ( X = 2 ), one can always permute the observation labels so that A3 holds. Moreover, A2then becomes the same as the first row being first order stochastic dominated by the second row. Therefore for X = 2 , the conditons for a monotone value function for a POMDP are identical to that for a fully observed MDP.Based on extensive numerical experiments, we conjecture that assumption A3 is not required for Theorem V.2. Conjecture V.1.
Under A1 and A2, the value function V ( π ) in (6) is MLR decreasing. This conjecture implies that monotone value functions for POMDPs require very similar conditions to monotonevalue functions for fully observed MDPs. Of course, the TP2 condition A2 for the transition matrix is stronger thanthe first order dominance conditions on the transition matrix used for fully observed MDPs.Finally we mention that one can also show that the value function involving controlled sensing with a Kalmanfilter is monotone [24]. In this case, the covariance matrices of the Kalman filters are partially ordered with respectto positive definiteness. Results for monotone HMM filters are given in [31]. These monotone results can also beused for POMDP bandits as discussed in [32]. One can also consider controlled sampling of an evolving duplicationdeletion graph; the dynamics of the belief are given by the HMM filter as described in [23].VI. B
LACKWELL D OMINANCE AND O PTIMALITY OF M YOPIC P OLICIES
A. Myopic Policy Bound to Optimal Decision Policy
Motivated by active sensing applications, consider the following POMDPs where based on the current belief state π k − , agent k chooses sensing mode u k ∈ { (low resolution sensor) , (high resolution sensor) } . The assumption that mode u = 2 yields more accurate observations than mode u = 1 is modeled as follows: Wesay mode 2 Blackwell dominates mode 1, denoted as B (2) (cid:23) B B (1) if B (1) = B (2) R. (18)Here R is a Y (2) × Y (1) stochastic matrix. R can be viewed as a confusion matrix that maps Y (2) probabilisticallyto Y (1) . (In a communications context, one can view R as a noisy discrete memoryless channel with input y (2) andoutput y (1) ). Intuitively (18) means that B (2) is more accurate than B (1) .The goal is to compute the optimal policy µ ∗ ( π ) ∈ { , } to minimize the expected cumulative cost incurred byall the agents J µ ( π ) = E µ { ∞ X k =0 ρ k C ( π k , u k ) } . (19)where ρ ∈ [0 , is the discount factor. Even though solving the above POMDP is computationally intractable ingeneral, using Blackwell dominance, we show below that a myopic policy forms a lower bound for the optimalpolicy. January 3, 2017 DRAFT
The value function V ( π ) and optimal policy µ ∗ ( π ) satisfy Bellman’s equation V ( π ) = min u ∈ U Q ( π, u ) , µ ∗ ( π ) = arg min u ∈ U Q ( π, u ) , J µ ∗ ( π ) = V ( π ) Q ( π, u ) = C ( π, u ) + ρ X y ( u ) ∈Y ( u ) V ( T ( π, y, u )) σ ( π, y, u ) ,T ( π, y, u ) = B y ( u ) ( u ) P ′ πσ ( π, y, u ) , σ ( π, y, u ) = ′ X B y ( u ) ( u ) P ′ π. (20)We now present the structural result. Let Π s ⊂ Π denote the set of belief states for which C ( π, < C ( π, .Define the myopic policy µ ( π ) = π ∈ Π s otherwise Theorem VI.1.
Assume that C ( π, u ) is concave with respect to π ∈ Π( X ) for each action u . Suppose B (2) (cid:23) B B (1) , i.e., B (1) = B (2) R holds where R is a stochastic matrix. Then the myopic policy µ ( π ) is a lower boundto the optimal policy µ ∗ ( π ) , i.e., µ ∗ ( π ) ≥ µ ( π ) for all π ∈ Π . In particular, for π ∈ Π s , µ ∗ ( π ) = µ ( π ) , i.e., it isoptimal to choose action 2 when the belief is in Π s .Remark : If B (1) (cid:23) B B (2) , then the myopic policy constitutes an upper bound to the optimal policy.Theorem VI.1 is proved below. The proof exploits the fact that the value function is concave and uses Jensen’sinequality. The usefulness of Theorem VI.1 stems from the fact that µ ( π ) is trivial to compute. It forms a provablelower bound to the computationally intractable optimal policy µ ∗ ( π ) . Since µ is sub-optimal, it incurs a highercumulative cost. This cumulative cost can be evaluated via simulation and is an upper bound to the achievableoptimal cost.Theorem VI.1 is non-trivial. The instantaneous costs satisfying C ( π, < C ( π, , does not trivially imply thatthe myopic policy µ ( π ) coincides with the optimal policy µ ∗ ( π ) , since the optimal policy applies to a cumulativecost function involving an infinite horizon trajectory of the dynamical system. B. Example 1. Optimal Filter vs Predictor Scheduling
Suppose u = 2 is an active sensor (filter) which obtains measurements of the underlying Markov chain and usesthe optimal HMM filter on these measurements to compute the belief and therefore the state estimate. So the usagecost of sensor 2 is high (since obtaining observations is expensive and can also result in increased threat of beingdiscovered), but its performance cost is low (performance quality is high).Suppose sensor u = 1 is a predictor which needs no measurement. So its usage cost is low (no measurement isrequired). However its performance cost is high since it is more inaccurate compared to sensor 2.Since the predictor has non-informative observation probabilities, its observation probability matrix is B (1) = Y X × Y . So clearly B (1) = B (2) B (1) meaning that the filter (sensor 2) Blackwell dominates the predictor (sensor1) Theorem VI.1 then says that if the current belief is π k , then if C ( π k , < C ( π k , , it is always optimal todeploy the filter (sensor 2). January 3, 2017 DRAFT0
C. Example 2. Ultrametric Matrices and Blackwell Dominance An X × X square matrix B is a symmetric stochastic ultrametric matrix if1) B is symmetric and stochastic.2) B ij ≥ min { B ik , B kj } for all i, j, k ∈ { , , . . . , X } .3) B ii > max { B ik } , k ∈ { , , . . . , X } − { i } (diagonally dominant).It is shown in [12] that if B is a symmetric stochastic ultrametric matrix, then the U -th root, namely B /U , is alsoa stochastic matrix for any positive integer U . Then with (cid:23) B denoting Blackwell dominance (18), clearly B /U (cid:23) B B / ( U ) (cid:23) B · · · (cid:23) B B ( U − /U (cid:23) B B. Consider a social network where the reputations of agents are denoted as u ∈ { , , . . . , U } . An agent with reputation u has observation probability matrix B ( U − u +1) /U . So an agent with reputation 1 (lowest reputation) is U degreesof separation from the source signal while an agent with reputation U (highest reputation) is 1 degree of separationfrom the source signal. The underlying source (state) could be a news event, sentiment or corporate policy thatevolves with time. A marketing agency can sample these agents - it can sample high reputation agents that haveaccurate observations but this costs more than sampling low reputation agents that have less accurate observations.Then Theorem VI.1 gives a suboptimal policy that provably lower bounds the optimal sampling policy. D. Proof of Theorem VI.1
Recall from Theorem III.1 that C ( π, u ) concave implies that V ( π ) is concave on Π( X ) . We then use the Blackwelldominance condition (18). In particular, T ( π, y (1) ,
1) = X y (2) ∈Y (2) T ( π, y (2) , σ ( π, y (2) , σ ( π, y (1) , P ( y (1) | y (2) ) σ ( π, y (1) ,
1) = X y (2) ∈Y (2) σ ( π, y (2) , P ( y (1) | y (2) ) . Therefore σ ( π,y (2) , σ ( π,y (1) , P ( y (1) | y (2) ) is a probability measure w.r.t. y (2) (since the denominator is the sum of thenumerator over all y (2) ). Since V ( · ) is concave, using Jensen’s inequality it follows that V ( T ( π, y (1) , V X y (2) ∈Y (2) T ( π, y (2) , σ ( π, y (2) , σ ( π, y (1) , P ( y (1) | y (2) ) ≥ X y (2) ∈Y (2) V ( T ( π, y (2) , σ ( π, y (2) , σ ( π, y (1) , P ( y (1) | y (2) )= ⇒ X y (1) V ( T ( π, y (1) , σ ( π, y (1) , ≥ X y (2) V ( T ( π, y (2) , σ ( π, y (2) , . (21) Although we do not pursue it here, conditions that ensure that the U -th root of a transition matrix is a valid stochastic matrix is importantin interpolating Markov chains. For example, transition matrices for credit ratings on a yearly time scale can be obtained from rating agenciessuch as Standard & Poor’s. Determining the transition matrix for periods of six months involves the square root of the yearly transition matrix[12]. January 3, 2017 DRAFT1
Therefore for π ∈ Π s , C ( π,
2) + ρ X y (2) V ( T ( π, y (2) , σ ( π, y (2) , ≤ C ( π,
1) + ρ X y (1) V ( T ( π, y (1) ) , σ ( π, y (1) , . So for π ∈ Π s , the optimal policy µ ∗ ( π ) = arg min u ∈ U Q ( π, u ) = 2 . So µ ( π ) = µ ∗ ( π ) = 2 for π ∈ Π s and ¯ µ ( π ) = 1 otherwise, implying that ¯ µ ( π ) is a lower bound for µ ∗ ( π ) .The above result is quite general and can be extended to controlled sensing of jump Markov linear systems [7],[33], [10]. VII. I NVERSE
POMDP
S AND R EVEALED P REFERENCES
How to develop data-centric non-parametric methods (algorithms and associated mathematical analysis) toidentify utility functions of agents?
Classical statistical decision theory arising in electrical engineering (and statistics)is model based: given a model , we wish to detect specific events in a dataset. The goal is the reverse: given adataset, we wish to determine if the actions of agents are consistent with utility maximization behavior, or moregenerally, consistent with play from a Nash equilibrium; and we then wish to estimate the associated utility function.Such problems will be studied using revealed preference methods arising in micro-economics. Classical revealedpreferences deals with analyzing choices made by individuals. The celebrated “Afriat’s theorem” [49], [4] providesa necessary and sufficient condition for a finite dataset to have originated from a utility maximizer. Specifically,revealed preferences [27], [14], [13], rational inattention, homophily [11], and social learning can be used to studymulti-agent behavior in social networks; particularly YouTube.R EFERENCES[1] D. Acemoglu and A. Ozdaglar. Opinion dynamics and learning in social networks.
Dynamic Games and Applications , 1(1):3–49, 2011.[2] M. Basseville and I.V. Nikiforov.
Detection of Abrupt Changes — Theory and Applications . Information and System Sciences Series.Prentice Hall, New Jersey, USA, 1993.[3] S. Bikchandani, D. Hirshleifer, and I. Welch. A theory of fads, fashion, custom, and cultural change as information cascades.
Journal ofPolitical Economy , 100(5):992–1026, October 1992.[4] R. Blundell. How revealing is revealed preference?
Journal of the European Economic Association , 3(2-3):211–235, 2005.[5] A. Caplin and M. Dean. Revealed preference, rational inattention, and costly information acquisition.
The American Economic Review ,105(7):2183–2203, 2015.[6] C. Chamley.
Rational herds: Economic Models of Social Learning . Cambridge University Press, 2004.[7] A. Doucet, N. Gordon, and V. Krishnamurthy. Particle filters for state estimation of jump Markov linear systems.
IEEE Transactions onSignal Processing , 49:613–624, 2001.[8] J. N. Eagle. The optimal search for a moving target when the search path is constrained.
Operations Research , 32:1107–1115, 1984.[9] D. Easley and J. Kleinberg.
Networks, Crowds, and Markets: Reasoning About a Highly Connected World . Cambridge University Press,2010.[10] R. Evans, V. Krishnamurthy, and G. Nair. Networked sensor management and data rate control for tracking maneuvering targets.
IEEETransactions on Signal Processing , 53(6):1979–1991, June 2005.[11] O. Gharehshiran, W. Hoiles, and V. Krishnamurthy. Detection of homophilic communities and coordination of interacting meta-agents: Agame-theoretic viewpoint.
IEEE Transactions on Signal and Information Processing over Networks , 2(1):84–101, 2016. In non-parametric detection theory, the set of decision rules can be considered to be the model.
January 3, 2017 DRAFT2 [12] N. Higham and L. Lin. On pth roots of stochastic matrices.
Linear Algebra and its Applications , 435(3):448–463, 2011.[13] W. Hoiles, V. Krishnamurthy, and A. Aprem. PAC algorithms for detecting nash equilibrium play in social networks: From twitter toenergy markets.
IEEE Access , 4:8147–8161, 2016.[14] W. Hoiles, O. Namvar, V. Krishnamurthy, N. Dao, and H. Zhang. Adaptive caching in the youtube content distribution network: A revealedpreference game-theoretic learning approach.
IEEE Transactions on Cognitive Communications and Networking , 1(1):71–85, 2015.[15] L. Johnston and V. Krishnamurthy. Opportunistic file transfer over a fading channel - a POMDP search theory formulation with optimalthreshold policies.
IEEE Transactions on Wireless Commun. , 5(2):394–405, Feb. 2006.[16] V. Krishnamurthy. Algorithms for optimal scheduling and management of hidden Markov model sensors.
IEEE Transactions on SignalProcessing , 50(6):1382–1397, June 2002.[17] V. Krishnamurthy. Bayesian sequential detection with phase-distributed change time and nonlinear penalty – a lattice programming POMDPapproach.
IEEE Transactions on Information Theory , 57(3):7096–7124, Oct. 2011.[18] V. Krishnamurthy. Quickest detection POMDPs with social learning: Interaction of local and global decision makers.
IEEE Transactionson Information Theory , 58(8):5563–5587, 2012.[19] V. Krishnamurthy. How to schedule measurements of a noisy Markov chain in decision making?
IEEE Transactions on InformationTheory , 59(9):4440–4461, July 2013.[20] V. Krishnamurthy.
Partially Observed Markov Decision Processes. From Filtering to Controlled Sensing . Cambridge University Press,2016.[21] V. Krishnamurthy, A. Aprem, and S. Bhatt. Opportunistic advertisement scheduling in live social media: A multiple stopping time pomdpapproach.
ArXiv e-prints , Nov. 2016.[22] V. Krishnamurthy and S. Bhatt. Sequential detection of market shocks with risk-averse cvar social sensors.
IEEE Journal Selected Topicsin Signal Processing , 2016.[23] V. Krishnamurthy, S. Bhatt, and T. Pedersen. Tracking infection diffusion in social networks: Filtering algorithms and threshold bounds.
ArXiv e-prints , 2016.[24] V. Krishnamurthy, R. Bitmead, M. Gevers, and E. Miehling. Sequential detection with mutual information stopping cost: Application inGMTI radar.
IEEE Transactions on Signal Processing , 60(2):700–714, 2012.[25] V. Krishnamurthy and D. Djonin. Structured threshold policies for dynamic sensor scheduling–a partially observed Markov decision processapproach.
IEEE Transactions on Signal Processing , 55(10):4938–4957, Oct. 2007.[26] V. Krishnamurthy and D.V. Djonin. Optimal threshold policies for multivariate POMDPs in radar resource management.
IEEE Transactionson Signal Processing , 57(10), 2009.[27] V. Krishnamurthy and W. Hoiles. Online reputation and polling systems: Data incest, social learning and revealed preferences.
IEEETransactions Computational Social Systems , 1(3):164–179, Jan. 2015.[28] V. Krishnamurthy and U. Pareek. Myopic bounds for optimal policy of POMDPs: An extension of Lovejoy’s structural results.
OperationsResearch , 62(2):428–434, 2015.[29] V. Krishnamurthy and H. V. Poor. Social learning and Bayesian games in multiagent signal processing: How do local and global decisionmakers interact?
IEEE Signal Processing Magazine , 30(3):43–57, 2013.[30] V. Krishnamurthy and H. V. Poor. A tutorial on interactive sensing in social networks.
IEEE Transactions on Computational SocialSystems , 1(1):3–21, March 2014.[31] V. Krishnamurthy and C. Rojas. Reduced complexity HMM filtering with stochastic dominance bounds: A convex optimization approach.
IEEE Transactions on Signal Processing , 62(23):6309–6322, 2014.[32] V. Krishnamurthy and B. Wahlberg. POMDP multiarmed bandits – structural results.
Mathematics of Operations Research , 34(2):287–302,May 2009.[33] A. Logothetis and V. Krishnamurthy. Expectation maximization algorithms for MAP estimation of jump Markov linear systems.
IEEETransactions on Signal Processing , 47(8):2139–2156, August 1999.[34] W. S. Lovejoy. On the convexity of policy regions in partially observed systems.
Operations Research , 35(4):619–621, July-August 1987.[35] W. S. Lovejoy. Some monotonicity results for partially observed Markov decision processes.
Operations Research , 35(5):736–743,Sept.-Oct. 1987.
January 3, 2017 DRAFT3 [36] I. MacPhee and B. Jordan. Optimal search for a moving target.
Probability in the Engineering and Information Sciences , 9:159–182,1995.[37] G. B. Moustakides. Optimal stopping times for detecting changes in distributions.
Annals of Statistics , 14:1379–1387, 1986.[38] T. Nakai. The problem of optimal stopping in a partially observable markov chain.
Journal of Optimization Theory and Applications ,45(3):425–442, 1985.[39] M. H. Ngo and V. Krishnamurthy. Optimality of threshold policies for transmission scheduling in correlated fading channels.
IEEETransactions on Communications , 57(8):2474–2483, 2009.[40] C. H. Papadimitriou and J.N. Tsitsiklis. The compexity of Markov decision processes.
Mathematics of Operations Research , 12(3):441–450,1987.[41] H. V. Poor. Quickest detection with exponential penalty for delay.
Annals of Stastistics , 26(6):2179–2205, 1998.[42] H. V. Poor and O. Hadjiliadis.
Quickest Detection . Cambridge University Press, 2008.[43] U. Rieder. Structural results for partially observed control models.
Methods and Models of Operations Research , 35(6):473–490, 1991.[44] U. Rieder and R. Zagst. Monotonicity and bounds for convex stochastic control models.
Mathematical Methods of Operations Research ,39(2):187–207, June 1994.[45] A. H. Sayed. Adaptation, learning, and optimization over networks.
Foundations and Trends in Machine Learning , 7(4–5):311–801, 2014.[46] A. N. Shiryaev. On optimum methods in quickest detection problems.
Theory of Probability and its Applications , 8(1):22–46, 1963.[47] C. Sims. Implications of rational inattention.
Journal of Monetary Economics , 50(3):665–690, 2003.[48] S. Singh and V. Krishnamurthy. Optimal access control for DS-CDMA cellular networks. In
International Conference on Acoustics,Speech, and Signal Processing , pages 2749–2752, Orlando, Fl., 2002.[49] H. Varian. The nonparametric approach to demand analysis.
Econometrica , 50(1):945–973, 1982.[50] P. Wang and P. Djuri´c. Distributed Bayesian estimation of linear models with unknown observation covariances.
IEEE Transactions onSignal Processing , 64(8):1962–1971, 2016.[51] B. Yakir, A. M. Krieger, and M. Pollak. Detecting a change in regression: First-order optimality.
Annals of Statistics , 27(6):1896–1913,1999., 27(6):1896–1913,1999.