Optimal stopping time on discounted semi-Markov processes
aa r X i v : . [ m a t h . P R ] J a n Optimal stopping time on discounted semi-Markovprocesses *Fang Chen † , Xianping Guo ‡ , Zhong-Wei Liao § Abstract:
This paper attempts to study the optimal stopping time for semi-Markovprocesses (SMPs) under the discount optimization criteria with unbounded cost rates. Inour work, we introduce an explicit construction of the equivalent semi-Markov decisionprocesses (SMDPs). The equivalence is embodied in the value functions of SMPs andSMDPs, that is, every stopping time of SMPs can induce a policy of SMDPs such that thevalue functions are equal, and vice versa. The existence of the optimal stopping time ofSMPs is proved by this equivalence relation. Next, we give the optimality equation of thevalue function and develop an e ff ective iterative algorithm for computing it. Moreover,we show that the optimal and ε -optimal stopping time can be characterized by the hittingtime of the special sets. Finally, to illustrate the validity of our results, an example of amaintenance system is presented in the end. Key Words. optimal stopping time, semi-Markov processes, value function, semi-Markovdecision processes, optimal policy, iterative algorithm
Mathematics Subject Classification.
Optimal stopping theory is an important branch of the intersection of probability and con-trol theory, which aims to find the optimal stopping time of stochastic systems according * Funding:
This work was partly supported by the National Natural Science Foundation of China (No.11931018, 61773411, 11701588) and the Guangdong Basic and Applied Basic Research Foundation (No.2020B1515310021) † School of Mathematics, Sun Yat-Sen University, Guangzhou 510275, China. Email: [email protected] ‡ School of Mathematics, Sun Yat-Sen University, Guangzhou 510275, China. Email: [email protected] § corresponding author. South China Research Center for Applied Mathematics and InterdisciplinaryStudies, South China Normal University, Guangzhou 510631, China. Email: [email protected]
1o a certain criterion. The optimal stopping theory of Markov processes have been widelyused in finance, such as the pricing of American options, see the monographs [2, 5, 17]and the references therein. About the discrete-time Markov processes, Dochviri [6] con-structed the corresponding relation between inhomogeneous and homogeneous Markovprocesses and solved the value function and the ε -optimal stopping time; Nikolaev [15]proposed the multi-objective stopping time problem on Markov chains and proved theexistence of ε -optimal stopping times. About the continuous-time Markov processes,Zhitlukhin & Shiryaev [22] gave the existence conditions of the optimal stopping timefor unbounded reward functions; B¨auerle & Popp [1] studied the stopping time problemsunder the risk-sensitive criteria and obtained the optimality equation of the value functionand the explicit expression of optimal stopping times; Ye [21] proved the existence ofoptimal stopping time and gave the formula for the corresponding value function underthe discount criteria.It is well known that the sojourn time of discrete-time Markov processes is constant,while that of continuous-time Markov processes satisfies the exponential distribution.However, in practical application, the sojourn time not satisfy either of these two situa-tions. The purpose of semi-Markov processes (SMPs) is to relax the condition of sojourntime, which means that it satisfies the general distribution. This paper investigates theoptimal stopping time of SMPs. As is well known, SMPs are a kind of general dynamicprogramming models with applications in many areas, see [2, 4, 8, 9, 10, 14, 13, 22] forinstance. For the optimal stopping time of SMPs, to the best of our knowledge, thereis only a relevant research, see [3]. Boshuizen & Gouweleeuw [3] studied the optimalstopping time problem of SMPs under discount criterion and solved the optimal stoppingtime by the dynamic programming method. It is worth noting that most of methods ofoptimal stopping problems are based on the martingale methods given by Snell [20]. Incontrast, according to the particularities of SMPs, we use the techniques, which are intro-duced by B¨auerle & Rieder [2] for discrete-time Markov processes, to study the optimalstopping time of SMPs. Intuitively, at each jump epoch of the SMPs, the system has twooptions, continue or stop. Therefore, we can construct an action set with only two points A = { , } , which are corresponding to the actions of the decision maker, i.e. continue( a =
0) or stop ( a = ff ective iterative algorithm forcomputing it, see Theorem 4.2. Furthermore, through the optimality equation, we provethat the optimal stopping time can be characterized by the hitting time of a special set.(4). In order to numerical calculation, we also give the concept of the ε -optimal stoppingtime and a su ffi cient condition to ensure that the ε -optimal stopping time is optimal, seeTheorem 4.5. The significance of this result is that we can replace the value function witha computable approximate function, such that the ε -optimal stopping time is equal to theoptimal one.The rest of our paper is organized as follows. In Section 2, we describe the optimalstopping problems of SMPs and then give the regular condition (Assumption 2.1). InSection 3, we present the explicit constructions of SMDPs, which are equivalent to theoptimal stopping problems of SMPs. The equivalences between stopping time of SMPsand policy of SMDPs are introduced in Proposition 3.6 and Theorem 3.7. In Section 4,according to these equivalences, the optimality equation and the iterative algorithm of thevalue function are given in Theorem 4.2. Moreover, we obtain the explicit expressionof the optimal stopping time, see Theorem 4.3. To numerical calculation, we give theconcept of ε -optimal stopping times and prove that under some conditions, the ε -optimalstopping time is equal to the optimal one of SMPs. Finally, we illustrate the validity ofour results by an example of a maintenance system in Section 5.3 The optimal stopping time model
This paper studies the model of SMPs on a denumerable state space S with transitionmechanism Q ( t , j | i ). Here and in what follows, Q ( t , j | i ) is always assumed that(i) given any t ∈ [0 , ∞ ), Q ( t , ·|· ) is a sub-stochastic kernel on S given S ;(ii) given any i , j ∈ S , Q ( · , j | i ) is a non-decreasing right continuous real-valued functionon [0 , ∞ ), which satisfies Q (0 , j | i ) = P ( ·|· ) : = lim t →∞ Q ( t , ·|· ) is a stochastic kernel on S given S .The evolution of the model is given as follows. At the beginning t =
0, the systemoccupies the state i ∈ S . Subsequently, the system remains in i for time t and thenjumps to i ∈ S governed by the kernel Q ( t , i | i ). To describe the history (or trajectory,pathway) of SMPs, we introduce the measurable space ( Ω , B ( Ω )), which is based on theKitaev construction (see [11, 12]), Ω = { ( i , t , i , . . . , t n , i n . . . ) : i , i n ∈ S , t n ∈ [0 , ∞ ) , n > } , and B ( Ω ) is the corresponding Borel σ -algebra. The element ω ∈ Ω is known as apathway of the system. The histories of SMPs up to the n -th jump epoch are h = i , h n + = ( i , t , i , . . . , t n + , i n + ) , n > . (2.1)Denote by H n the set of all histories h n up to the n -th jump epoch, which is endowed withthe Borel σ -algebra B ( H n ). For each n > ω = ( i , t , i , . . . , t n , i n , . . . ) ∈ Ω , define X n ( ω ) = i n , T ( ω ) = , T n + ( ω ) = t n + where X n denotes the state at the n -th jump epoch and T n + denotes the sojourn timebetween the n -th jump epoch and ( n + H n ( ω ) : = ( X ( ω ) , T ( ω ) , X ( ω ) , . . . , T n ( ω ) , X n ( ω )) . (2.2)In what follow, the argument ω is always omitted except some special informational state-ments. For each n >
0, denote by F n = σ ( H n ) the natural fluid of SMPs. For each i ∈ S ,4y the Tulcea theorem (see [7, Proposition C.10]), there exists a unique probability mea-sure P i on ( Ω , B ( Ω )) such that, for each t ∈ [0 , ∞ ), j ∈ S and h n = ( i , t , i , . . . , t n , i n ) ∈H n ( n >
0) it holds that P i ( T = , X = i ) = , (2.3) P i ( T n + t , X n + = j | H n = h n ) = Q ( t , j | i n ) . (2.4)Hence, ( Ω , B ( Ω ) , P i ) becomes the probability space of SMPs, which is equipped with thenatural fluid { F n , n > } . Denote by E i the expectation with respect to P i . Given any n >
0, we define the n -th jump epoch as S n = n X k = T k . To ensure the regularity of SMPs, we give the following assumption.
Assumption 2.1.
There exist constants δ > and ǫ > , such that X j ∈ S Q ( δ, j | i ) − ǫ, ∀ i ∈ S . (2.5)The Assumption 2.1 is a standard regular condition widely used in SMPs and SMDPs,see [8, 9, 16, 18], for instance. According to [9], the Assumption 2.1 implies that P i ( lim n →∞ S n = ∞ ) = , ∀ i ∈ S . Corresponding to the process { ( T n , X n ) , n > } , we define an underlying continuous-timestate process { X ( t ) , t ∈ [0 , ∞ ) } by X ( t ) = X n , S n t < S n + . (2.6)Refer to Limnios and Oprisan [14] for more details about the constructions of { X ( t ) , t ∈ [0 , ∞ ) } and the properties given in (2.3) and (2.4).Next step, we introduce the optimal stopping time problems of SMPs. A mapping τ : Ω → N ∪ { + ∞} is called a F n -stopping time, if for each n >
0, it holds that { ω ∈ Ω : τ ( ω ) = n } ∈ F n . (2.7)Denote by Γ the set of all F n -stopping times. In the absence of ambiguity, we write F n -stopping time as stopping time. Let c ( i ) and g ( i ) be the nonnegative real-valued functions5n S , which represent the cost rate function and the terminal cost function respectively.Fixed any discount factor β >
0, the infinite horizon discounted cost is defined as: R τ : = Z S τ e − β t c ( X ( t ))d t + g ( X ( S τ )) e − β S τ , τ < + ∞ ; Z + ∞ e − β t c ( X ( t ))d t , τ = + ∞ . The infinite horizon expected discounted cost of a stopping time τ ∈ Γ is given by V τ ( i ) : = E i [ R τ ] , i ∈ S . (2.8) Definition 2.2.
The function V ∗ ( i ) : = inf τ ∈ Γ V τ ( i ) is called the value function (or minimumexpected discounted cost) of SMPs. A stopping time τ ∗ ∈ Γ is called optimal if it achievesthe infimum, i.e. V τ ∗ ( i ) = V ∗ ( i ) = inf τ ∈ Γ V τ ( i ) , for all i ∈ S .
The main purpose of this paper is to find an optimal stopping time and give an algo-rithm for computing the value function V ∗ ( i ). In this section, we will introduce the equivalent SMDPs corresponding to the originaloptimal stopping problem of SMPs. Intuitively, in the SMPs, stop or continue can beconsidered as a special action in the corresponding SMDPs. This intuition gives us anidea to construct the SMDPs.The details about the constructions of SMDPs are given as follows. Here and in whatfollow, we always use “ ˆ · ” to distinguish the corresponding SMDPs from the originalSMPs. The model of SMDPs is introduced by the four-tuple: n ˆ S , ( A ( i ) ⊂ A ) , ˆ Q ( t , j | i , a ) , ˆ c ( i , a ) o . (3.1)The state space ˆ S : = S ∪ { ∆ } is a denumerable space including the space S of SMPs and avirtual state ∆ . The action space A ( i ), which denotes the set of admissible actions at state i ∈ ˆ S , is defined as A ( i ) : = ( { , } , i ∈ S ; { } , i = ∆ , A = ∪ i ∈ ˆ S A ( i ) = { , } is finite. Denote by K : = { ( i , a ) : i ∈ ˆ S , a ∈ A ( i ) } the set of feasible state-actionpairs. The semi-Markov kernel ˆ Q ( t , j | i , a ) of the SMDPs is given byˆ Q ( t , j | i , a ) : = Q ( t , j | i ) , i ∈ S , j ∈ S , a = [1 , + ∞ ) ( t ) , i ∈ ˆ S , j = ∆ , a = , otherwise , (3.2)where Q ( t , j | i ) is the kernel of SMPs and E is the indicator function on the set E . Finally,the cost rate function of SMDPs is defined asˆ c ( i , a ) : = c ( i ) , i ∈ S , a = β g ( i ) / (1 − e − β ) , i ∈ S , a = , i = ∆ , a = , (3.3)where c ( i ) is the cost rate function and g ( i ) is the terminal cost function of SMPs.The definitions of the history of SMDPs and the history-dependent policy are exactlythe same as in [8, 9], but for the ease of reading, we repeat it here. The trajectory spaceof SMDPs is defined asˆ Ω = { ( i , a , t , i , a , . . . , t n , i n , a n , . . . ) : t m + ∈ [0 , + ∞ ) and ( i m , a m ) ∈ K for m > } , which is equipped with the Borel σ -algebra B ( ˆ Ω ). Moreover, The histories of SMDPs upto the n -th jump epoch have the formˆ h = i , ˆ h n + = ( i , a , t , i , . . . , a n , t n + , i n + ) , n > . (3.4)Denote by ˆ H n the set of all histories ˆ h n , which is endowed with the Borel σ -algebra B ( ˆ H n ). For each ˆ ω = ( i , a , t , i , a , . . . , t n , i n , a n , . . . ) ∈ ˆ Ω , letˆ X n ( ˆ ω ) = i n , ˆ A n ( ˆ ω ) = a n , ˆ T ( ˆ ω ) = , ˆ T n + ( ˆ ω ) = t n + , ˆ S n ( ˆ ω ) : = n X k = ˆ T k ( ˆ ω ) , ∀ n > , and ˆ H n = ( ˆ X , ˆ A , ˆ T , ˆ X , ˆ A , . . . , ˆ T n , ˆ X n ). Similar to (2.6), we define the continuous-timeprocesses ˆ X ( t ), ˆ A ( t ) and the n -th jump epoch as followingˆ X ( t ) = ˆ X n , ˆ A ( t ) = ˆ A n , ˆ S n t < ˆ S n + . (3.5)The definition of a deterministic history-dependent policy of SMDPs is given below,which specifies a decision rule to select actions.7 efinition 3.1. A deterministic history-dependent policy is a sequence π = { f n , n > } ofmeasurable functions f n : ˆ H n → A satisfyingf n (ˆ h n ) ∈ A ( i n ) , ∀ ˆ h n = ( i , a , t , i , . . . , a n − , t n , i n ) ∈ ˆ H n , n > . In particular, the deterministic history-dependent policy is called deterministic stationaryif f n are independent of n. Write π = { f , f , . . . } as f for simplicity. Denote by Π DH and Π DS the sets of all deterministic history-dependent and deterministic stationary policies,respectively. For each i ∈ ˆ S , the expected discounted cost of a policy π ∈ Π DH is defined as U π ( i ) = ˆ E π i "Z ∞ e − β t ˆ c ( ˆ X ( t ) , ˆ A ( t ))d t , where ˆ E π i is the expectation depended on the state i and the policy π , which is guaranteedby the Tulcea theorem (see [7, Proposition C.10] or [8, Section 2]). The value function ofSMDPs is given as U ∗ ( i ) = inf π ∈ Π DH U π ( i ).Next step, we will focus on the relationship between the stopping times of SMPs andthe policies of SMDPs. For each history of SMPs until the n -th jump epoch h n ∈ H n givenin (2.1), we define a map M n as M n ( h n ) = ( i , , t , i , , . . . , t n , i n ) ∈ ˆ H n . (3.6)The action “0” (means continuation) added to the equation (3.6) indicates that the systemhas been running incessantly until the n -th jump epoch. Obviously, M n ( H n ) : = { M n ( h n ) : h n ∈ H n } ∈ B ( ˆ H n ). Generally, define M n ( C ) : = { M n ( h n ) : h n ∈ C } ∈ B ( ˆ H n ) for eachsubset C ∈ B ( H n ). For each stopping time τ ∈ Γ , we can introduce a policy π τ in thefollowing way. Definition 3.2.
Given any stopping time τ ∈ Γ and n > , letB τ n : = (cid:8) H n ( ω ) : ω ∈ Ω , τ ( ω ) = n (cid:9) , (3.7) where H n is given in (2.2). For each history ˆ h n = ( i , a , t , i , . . . , a n − , t n , i n ) ∈ ˆ H n , definef τ n (ˆ h n ) : = ( B τ n ( i , t , i , . . . , t n , i n ) , ˆ h n ∈ M n ( H n );1 , ˆ h n ∈ ˆ H n \ M n ( H n ) . The policy π τ : = { f τ n , n > } is called the policy induced by the stopping time τ . emma 3.3. For each stopping time τ ∈ Γ , the induced policy π τ is a deterministichistory-dependent policy of the corresponding SMDPs. Proof . By Definition 3.2, it holds that f τ n (ˆ h n ) ∈ A ( i n ). Then, we just need to considerthe measurability. Noting that B τ n ∈ B ( H n ), we have (cid:8) ˆ h n ∈ ˆ H n : f τ n (ˆ h n ) = (cid:9) = M n ( H n ) T M n (cid:0) ( B τ n ) c (cid:1) ∈ B ( ˆ H n ) and (cid:8) ˆ h n ∈ ˆ H n : f τ n (ˆ h n ) = (cid:9) = ˆ H n \ (cid:8) ˆ h n ∈ ˆ H n : f τ n (ˆ h n ) = (cid:9) ∈ B ( ˆ H n ). Hence, the policy π τ : = { f τ n , n > } becomes a deterministic history-dependentpolicy of the corresponding SMDPs. (cid:3) The Definition 3.2 and Lemma 3.3 say that given any stopping time of SMPs, we canconstruct a history-dependent policy of SMDPs. Next, on the contrary, given any history-dependent policy of SMDPs, we construct a corresponding stopping time of SMPs, seeDefinition 3.4 and Lemma 3.5 below.
Definition 3.4.
Given any deterministic history-dependent policy π = { f n , n > } . Foreach ω ∈ Ω , we define τ π ( ω ) : = inf { n ∈ N : f n ( M n ( H n ( ω )) = } , where inf {∅} : = + ∞ and M n is given in (3.6). Then τ π is called the stopping time inducedby the policy π . Lemma 3.5.
For each deterministic history-dependent policy π = { f n , n > } of SMDPs,the induced stopping time τ π is a stopping time. Proof . Note that for each n >
0, the random variable H n = ( X , T , X , . . . , T n X n ), themapping M n and the function f n are measurable in their corresponding spaces. Hence, wehave { τ π = n } = n − \ k = { f k ( M k ( H k )) = } \ { f n ( M n ( H n )) = } ∈ F n , which implies that τ π is a F n -stopping time. (cid:3) Those results give us an idea that the stopping times of SMPs and the policies ofthe SMDPs are one-to-one correspondence. Hence, we give the following proposition toverify this idea. 9 roposition 3.6.
Given any stopping time τ ∈ Γ of SMPs, it holds that τ = τ π τ , (3.8) where π τ = { f τ n , n > } is the policy induced by τ and τ π τ is the stopping time induced by π τ . Proof . Fix any n >
0. For each ω ∈ Ω , we have f τ n ( M n ( H n ( ω ))) = B τ n ( H n ( ω )) = { τ = n } ( ω ) . (3.9)Hence, by Definition 3.4, it holds that τ π τ ( ω ) = inf (cid:8) n ∈ N : f τ n ( M n ( H n ( ω ))) = (cid:9) = inf (cid:8) n ∈ N : { τ = n } = (cid:9) . Thus, we obtain that (cid:8) τ π τ = n (cid:9) = (cid:8) ω ∈ Ω : { τ = k } ( ω ) = , k n − , { τ = n } ( ω ) = (cid:9) = (cid:8) τ = n (cid:9) . By the arbitrariness of n >
0, we obtain τ = τ π τ . (cid:3) In short, Proposition 3.6 ensure that the relationship between the stopping times ofSMPs and the policies of the SMDPs is one to one correspondence. Moreover, for anystopping time τ , the infinite horizon expected discounted cost of τ is equivalent to theexpected discounted cost of the policy π τ that corresponds to it, see Theorem 3.7 below. Theorem 3.7.
Given any τ ∈ Γ , let π τ = { f τ n , n > } be the induced policy of τ . Then, wehave V τ ( i ) = U π τ ( i ) , ∀ i ∈ S . (3.10) Proof . For each ˆ ω = ( i , a , t , . . . , i n , a n , t n + . . . ) ∈ ˆ Ω , ˆ H n ( ˆ ω ) = ( i , a , t , i . . . , a n − , t n , i n )is the history of SMDPs up to the n -th jump epoch. Denote by { C n , n > } and C thesubsets of ˆ Ω , which are defined as C n : = n ˆ ω ∈ ˆ Ω : inf n k ∈ N : f τ k ( ˆ H k ( ˆ ω )) = o = n o , n > C : = n ˆ ω ∈ ˆ Ω : f τ k ( ˆ H k ( ˆ ω )) = k > o . It is easy to know that { C , C n , n > } is a partition of ˆ Ω . Then, using the monotoneconvergence theorem, we obtain U π τ ( i ) = + ∞ X k = ˆ E π τ i " C k Z ∞ e − β t ˆ c ( ˆ X ( t ) , ˆ A ( t ))d t + ˆ E π τ i " C Z ∞ e − β t ˆ c ( ˆ X ( t ) , ˆ A ( t ))d t . (3.11)10o calculate the first item of (3.11), for each k >
0, we useˆ E π τ i " C k Z ∞ e − β t ˆ c ( ˆ X ( t ) , ˆ A ( t ))d t = ˆ E π τ i C k + ∞ X m = Z ˆ S m + ˆ S m e − β t ˆ c ( ˆ X ( t ) , ˆ A ( t ))d t = + ∞ X m = β ˆ E π τ i h C k (cid:16) e − β ˆ S m − e − β ˆ S m + (cid:17) ˆ c ( ˆ X m , ˆ A m ) i = k X m = β ˆ E π τ i h C k (cid:16) e − β ˆ S m − e − β ˆ S m + (cid:17) ˆ c ( ˆ X m , ˆ A m ) i . (3.12)The third equality of (3.12) is based on the definition of ˆ Q given in (3.2). In fact, thesemi-Markov kernel ˆ Q satisfies ˆ Q ( t , ∆ | i , = t > i ∈ ˆ S , and the action setsatisfies A ( ∆ ) = { } . These mean that once the action 1 is selected, the process will jumpto state ∆ with probability one after a unit time, and then stay in ∆ forever, i.e.ˆ A k ( ˆ ω ) = , ˆ A l ( ˆ ω ) = , ˆ X l ( ˆ ω ) = ∆ , ˆ ω ∈ C k and l > k + . (3.13)Since ˆ c ( ∆ , =
0, it holds that C k ˆ c ( ˆ X l , ˆ A l ) = l > k +
1. In next step, our goalis to show thatˆ E π τ i h C k (cid:16) e − β ˆ S m − e − β ˆ S m + (cid:17) ˆ c ( ˆ X m , ˆ A m ) i = E i h { τ = m } (cid:16) e − β S m − e − β S m + (cid:17) c ( X m ) i , m < k ; E i h β { τ = k } (cid:16) e − β S k (cid:17) g ( X k ) i , m = k . (3.14)In the beginning, let’s discuss the relationship between C k and { τ = k } . By the definitionof C k , ˆ ω ∈ C k if and only if f τ m ( ˆ H m ( ˆ ω )) = m < k ) and f τ k ( ˆ H k ( ˆ ω )) =
1. Conversely,according to Definition 3.4 and Proposition 3.6 , we have τ ( ω ) = τ π τ ( ω ) = inf (cid:8) n ∈ N : f τ n ( M n ( H n ( ω ))) = (cid:9) , which means that { τ = k } = Q k − m = (1 − f τ m ( M m ( H m ))) × f τ k ( M k ( H k )). For each 0 m < k , wehaveˆ E π τ i h C k (cid:16) e − α ˆ S m − e − α ˆ S m + (cid:17) ˆ c ( ˆ X m , ˆ A m ) i = X i ∈ S δ i ( i ) X i ∈ ˆ S Z ∞ ˆ Q (d t , i | i , · · · X i k ∈ ˆ S Z ∞ ˆ Q (d t k , i k | i k − , c ( i m , × h e − β P mn = t n − e − β P m + n = t n i k − Y n = (cid:0) − f τ n ( i , , t , . . . , , t n , i n ) (cid:1) × f τ k ( i , , t , . . . , , t k , i k ) , (3.15)11here P n = t n : =
0. Then, by the definitions of ˆ Q , r , and f τ n given in (3.2), (3.3), andDefinition 3.2, we obtainˆ E π τ i h C k (cid:16) e − β ˆ S m − e − β ˆ S m + (cid:17) ˆ c ( ˆ X m , ˆ A m ) i = X i ∈ S δ i ( i ) X i ∈ S Z ∞ Q (d t , i | i ) · · · X i k ∈ S Z ∞ Q (d t k , i k | i k − ) h e ( − β P mn = t n ) − e ( − β P m + n = t n ) i × c ( i m ) k − Y n = (cid:0) − f τ n ( M n ( i , t , i , . . . , t n , i n )) (cid:1) × f τ k ( M k ( i , t , i , . . . , t k , i k )) = E i h { τ = k } c ( X m ) (cid:16) e − β S m − e − β S m + (cid:17)i . (3.16)In the same way, we can calculate (3.14) in the case m = k . The only thing to be carefulabout is that ˆ Q ( t , ∆ | i , = [1 , + ∞ ) ( t ) for all i ∈ ˆ S . It means that the system will occupy thestate i with a unit time if the action 1 is selected at the state i . Since ˆ c ( i , = β g ( i ) / (1 − e − β ),we haveˆ E π τ i h C k (cid:16) e − β ˆ S k − e − β ˆ S k + (cid:17) ˆ c ( ˆ X k , ˆ A k ) i = X i ∈ ˆ S δ i ( i ) X i ∈ ˆ S Z ∞ ˆ Q (d t , i | i , · · · X i k ∈ ˆ S Z ∞ ˆ Q (d t k , i k | i k − , β g ( i k )1 − e − β × e ( − β P kn = t n ) (cid:16) − e − β (cid:17) k − Y n = (cid:0) − f τ n ( i , , t , . . . , , t n , i n ) (cid:1) f τ k ( i , , t , . . . , , t k , i k ) = β X i ∈ S δ i ( i ) X i ∈ S Z ∞ Q (d t , i | i ) · · · X i k ∈ S Z ∞ Q (d t k , i k | i k − ) e ( − β P kn = t n ) g ( i k ) × k − Y n = (cid:0) − f τ n ( M n ( i , t , . . . , t n , i n )) (cid:1) × f τ k ( M k ( i , t , . . . , t k , i k )) = E i h β { τ = k } e − β S k g ( X k ) i . (3.17)Hence, (3.16) and (3.17) imply that (3.14) holds. Moreover, by (3.12) and (3.14), we haveˆ E π τ i " C k Z ∞ e − β t ˆ c ( ˆ X ( t ) , ˆ A ( t ))d t = k X m = β ˆ E π τ i h C k (cid:16) e − β ˆ S m − e − β ˆ S m + (cid:17) ˆ c ( ˆ X m , ˆ A m ) i = k − X m = β E i h { τ = k } (cid:16) e − β S m − e − β S m + (cid:17) c ( X m ) i + β E i h β { τ = k } e − β S k g ( X k ) i = E i (cid:2) { τ = k } R τ (cid:3) . (3.18)12ext, we calculate the second item of (3.11). Noting that { C , C n , n > } is a partition ofˆ Ω , for all k > C = − + ∞ X n = C n × k Y m = (1 − C m ) = k Y m = (1 − C m ) − + ∞ X n = k + C n , which implies thatˆ E π τ i (cid:20) C Z ∞ ˆ c ( ˆ X ( t ) , ˆ A ( t ))d t (cid:21) = + ∞ X k = β (cid:26) ˆ E π τ i (cid:20) k Y m = (1 − C m )ˆ c ( ˆ X k , ˆ A k )( e − β ˆ S k − e − β ˆ S k + ) (cid:21) − + ∞ X n = k + ˆ E π τ i (cid:20) C n ˆ c ( ˆ X k , ˆ A k )( e − β ˆ S k − e − β ˆ S k + ) (cid:21)(cid:27) (3.19)Similar to (3.19), we decompose E i (cid:20) { τ =+ ∞} R τ (cid:21) as E i (cid:20) { τ =+ ∞} R τ (cid:21) = + ∞ X k = β ( E i (cid:20) k Y m = (1 − { τ = m } ) c ( X k )( e − β S k − e − β S k + ) (cid:21) − + ∞ X n = k + E i (cid:20) { τ = n } c ( X k )( e − β S k − e − β S k + ) (cid:21)) . (3.20)For the first item of (3.20), similar to (3.16), it holds that E i (cid:20) k Y m = (1 − { τ = m } ) c ( X k )( e − β S k − e − β S k + ) (cid:21) = ˆ E π τ i (cid:20) k Y m = (1 − C m )ˆ c ( ˆ X k , ˆ A k )( e − β ˆ S k − e − β ˆ S k + ) (cid:21) . Hence, together with (3.16), (3.19), and (3.20), it holds that E i (cid:20) { τ =+ ∞} R τ (cid:21) = ˆ E π τ i (cid:20) C Z ∞ e − β t ˆ c ( ˆ X ( t ) , ˆ A ( t ))d t (cid:21) . (3.21)Finally, according to (3.11), (3.18) and (3.21), we obtain U π τ ( i ) = + ∞ X k = E i (cid:2) { τ = k } R τ (cid:3) + E i (cid:2) { τ = ∞} R τ (cid:3) = V τ ( i ) . The proof of this theorem is completed. (cid:3) The iterative algorithm and optimal stopping times
Those results given in Section 3 state that for each stopping time τ of SMPs, there existsan equivalent policy π τ of the corresponding SMDPs. Hence, we can analyze the valuefunction V ∗ and the optimal stopping time τ ∗ of SMPs through the conclusions of thecorresponding SMDPs. To do so, we give some results about the value function U ∗ andthe optimal policy of SMDPs by using the results in [8]. Note that the regular condition(Assumption 2.1) is needed. Lemma 4.1.
Suppose that Assumption 2.1 holds. For the corresponding SMDPs definedin (3.1), the following statements hold.(a). Let U ∗− ( i ) ≡ . For each n > and i ∈ ˆ S , define a sequence of functions U ∗ n on ˆ SasU ∗ n ( i ) = min a ∈ A ( i ) ˆ c ( i , a ) Z ∞ e − β t − X j ∈ ˆ S ˆ Q ( t , j | i , a ) d t + X j ∈ ˆ S Z ∞ e − β t ˆ Q (d t , j | i , a ) U ∗ n − ( j ) . (4.1) Then, for each i ∈ ˆ S , U ∗ n ( i ) are non-descending and the value function of SMDPs U ∗ ( i ) satisfies U ∗ ( i ) = lim n →∞ U ∗ n ( i ) for each i ∈ ˆ S . Moreover, it holds thatU ∗ ( i ) = min a ∈ A ( i ) ˆ c ( i , a ) Z ∞ e − β t − X j ∈ ˆ S ˆ Q ( t , j | i , a ) d t + X j ∈ ˆ S Z ∞ e − β t ˆ Q (d t , j | i , a ) U ∗ ( j ) . (4.2) (b). A deterministic stationary policy f ∈ Π DS is optimal if and only if for each i ∈ ˆ SU ∗ ( i ) = ˆ c ( i , f ( i )) Z ∞ e − β t − X j ∈ ˆ S ˆ Q ( t , j | i , f ( i )) d t + X j ∈ ˆ S Z ∞ e − β t ˆ Q (d t , j | i , f ( i )) U ∗ ( j ) . (4.3) (c). There exists a deterministic stationary policy f ∗ ∈ Π DS satisfying (4.3), whichmeans f ∗ is optimal. Proof . According to Assumption 2.1, there exists δ > ǫ > i ∈ S , P j ∈ S Q ( δ, j | i ) − ǫ . Let δ ∗ : = min { δ, / } . By the definition of ˆ Q in (3.2), we have that X j ∈ ˆ S ˆ Q ( δ ∗ , j | i , a ) = ( P j ∈ S Q ( δ ∗ , j | i ) P j ∈ S Q ( δ, j | i ) − ǫ, i ∈ S , a = [1 , + ∞ ) ( δ ∗ ) = − ǫ, i ∈ ˆ S , a = , (4.4)14hich implies P j ∈ ˆ S ˆ Q ( δ ∗ , j | i , a ) − ǫ for all ( i , a ) ∈ K . Hence, the regular conditionof [8] holds. Part ( a ), ( b ) and ( c ) come from [8, Theorem 4.1], [8, Theorem 3.1] and [8,Theorem 3.2] respectively. (cid:3) With these preparation in hand, we can give our main results on SMPs. That is, theexistence of the optimal stopping time of SMPs and then we give an iterative algorithm forcomputing the value function V ∗ , see Theorem 4.2 below. To establish the algorithm, wedefine an operator T from M to M , where M denotes the set of all non-negative real-valuedfunctions on S , i.e. for each V ∈ M , T V ( i ) is defined by T V ( i ) : = min g ( i ) , c ( i ) Z ∞ e − β t − X j ∈ S Q ( t , j | i ) d t + X j ∈ S Z ∞ e − β t Q (d t , j | i ) V ( j ) . (4.5)From this expression, we can verify that T is a monotone operator, i.e. T V T V if V V . Theorem 4.2.
Suppose that Assumption 2.1 holds. Then, the following statements hold.(a). There exists an optimal stopping time of SMPs.(b). Let V ∗− ( i ) ≡ and V ∗ n ( i ) = T V ∗ n − ( i ) for each n > and i ∈ S . Then, for eachi ∈ S , V ∗ n ( i ) are non-descending and the value function of SMPs V ∗ satisfies V ∗ ( i ) = lim n →∞ V ∗ n ( i ) and is the solution to the following optimality equationV ∗ ( i ) = T V ∗ ( i ) , ∀ i ∈ S . Proof . ( a ). Under Assumption 2.1, Lemma 4.1 says that there exists a policy f ∗ ∈ Π DS ⊂ Π DH such that for all i ∈ S , U ∗ ( i ) = U f ∗ ( i ). Using Proposition 3.6 and Theorem 3.7, wehave U ∗ ( i ) = inf π ∈ Π DH U π ( i ) = inf τ ∈ Γ V τ ( i ) V τ f ∗ ( i ) = U f ∗ ( i ) = U ∗ ( i ) , ∀ i ∈ S , (4.6)where τ f ∗ is the stopping time induced by the policy f ∗ . Hence, we have V ∗ ( i ) = inf τ ∈ Γ V τ ( i ) = V τ f ∗ ( i ) , ∀ i ∈ S , which means that τ f ∗ is the optimal stopping time of SMPs.( b ). When i = ∆ ∈ ˆ S , the functions U ∗ n ( i ) given in Lemma 4.1 satisfy that U ∗ n ( ∆ ) = r ( ∆ , Z ∞ e − β t (cid:16) − ˆ Q ( t , ∆ | ∆ , (cid:17) d t + Z ∞ e − β t ˆ Q (d t , ∆ | ∆ , U ∗ n − ( ∆ ) , n > . U ∗− ( ∆ ) = U ∗ n ( ∆ ) = n >
0. Hence, weonly consider the case i ∈ S . By Lemma 4.1 and (4.6), we have V ∗ ( i ) = U ∗ ( i ) = lim n → + ∞ U ∗ n ( i ) , ∀ i ∈ S . (4.7)Next, by induction, we aim to prove that U ∗ n ( i ) = V ∗ n ( i ) , ∀ i ∈ S and n > − . (4.8)Clearly, (4.8) holds for n = −
1. Assume that it holds for n = k −
1, we now consider thecase n = k . Note that A ( i ) = { , } for each i ∈ S , the expression (4.1) implies that U ∗ k ( i ) = min a ∈ A ( i ) ˆ c ( i , a ) Z ∞ e − β t (1 − X j ∈ ˆ S ˆ Q ( t , j | i , a ))d t + X j ∈ ˆ S Z ∞ e − β t ˆ Q (d t , j | i , a ) U ∗ k − ( j ) = min ( ˆ c ( i , Z ∞ e − β t − X j ∈ ˆ S ˆ Q ( t , j | i , d t + X j ∈ ˆ S Z ∞ e − β t ˆ Q (d t , j | i , U ∗ k − ( j ) , ˆ c ( i , Z ∞ e − β t − X j ∈ ˆ S ˆ Q ( t , j | i , d t + X j ∈ ˆ S Z ∞ e − β t ˆ Q (d t , j | i , U ∗ k − ( j ) ) . (4.9)For the second item of (4.9), by the definitions of ˆ Q in (3.2) and r in (3.3), we haveˆ c ( i , Z ∞ e − β t − X j ∈ ˆ S ˆ Q ( t , j | i , d t + X j ∈ ˆ S Z ∞ e − β t ˆ Q (d t , j | i , U ∗ k − ( j ) = ˆ c ( i , Z ∞ e − β t (cid:16) − ˆ Q ( t , ∆ | i , (cid:17) d t + Z ∞ e − β t ˆ Q (d t , ∆ | i , U ∗ k − ( ∆ ) = β g ( i )1 − e − β Z e − β t d t = g ( i ) . Hence, U ∗ k ( i ) = min ˆ c ( i , Z ∞ e − β t − X j ∈ ˆ S ˆ Q ( t , j | i , d t + X j ∈ ˆ S Z ∞ e − β t ˆ Q (d t , j | i , U ∗ k − ( j ) , g ( i ) = min c ( i ) Z ∞ e − β t − X j ∈ S Q ( t , j | i ) d t + X j ∈ S Z ∞ e − β t Q (d t , j | i ) V ∗ k − ( j ) , g ( i ) = T V ∗ k − ( i ) = V ∗ k ( i ) . b ) of this theorem holds, and we complete theproof. (cid:3) From the perspective of practical application, it is not enough to only have the exis-tence of the optimal stopping time of SMPs. We are more looking forward to giving acomputable characterization of the optimal stopping time. In the following theorem, wegive an optimal stopping time, which is equivalent to the hitting time of a special set.
Theorem 4.3.
Suppose that Assumption 2.1 holds. Let V ∗ ( i ) be the value function ofSMPs. Define a subset of S byS ∗ : = { i ∈ S : g ( i ) = T V ∗ ( i ) } . (4.10) Then for each ω = ( i , t . . . , i n , t n + . . . ) ∈ Ω , the optimal stopping time is equal to thehitting time of S ∗ , that is τ ∗ ( ω ) = + ∞ , S ∗ = ∅ ;inf { n ∈ N : i n ∈ S ∗ } , S ∗ , ∅ . (4.11) Proof . According to Theorem 4.2, we have V ∗ ( i ) = g ( i ) , i ∈ S ∗ ; c ( i ) Z ∞ e − β t − X j ∈ S Q ( t , j | i ) d t + X j ∈ S Z ∞ e − β t Q (d t , j | i ) V ∗ ( j ) , i ∈ S \ S ∗ . Let U ∗ ( i ) be the value function of the corresponding SMDPs, and its existence is guaran-teed by Lemma 4.1. Moreover, by (4.7), it holds that V ∗ ( i ) = U ∗ ( i ) for each i ∈ S and U ∗ ( ∆ ) = i ∈ ˆ S , denote by f ∗ ( i ) = S ∗ ∪{ ∆ } ( i ) a deterministic stationary policy ofSMDPs. What we need to do is to verify that f ∗ ∈ Π DS is an optimal policy of SMDPs,that is, f ∗ satisfies (4.3) of Lemma 4.1. Since A ( ∆ ) = { } , f ∗ ( ∆ ) = i ∈ S ∗ , we have f ∗ ( i ) =
1. Hence, the definitions of ˆ Q and r imply thatˆ c ( i , f ∗ ( i )) Z ∞ e − β t − X j ∈ ˆ S ˆ Q ( t , j | i , f ∗ ( i )) d t + X j ∈ ˆ S Z ∞ e − β t ˆ Q (d t , j | i , f ∗ ( i )) U ∗ ( j ) = g ( i ) = V ∗ ( i ) = U ∗ ( i ) . i ∈ S \ S ∗ , we have f ∗ ( i ) =
0. Again, using the definitions of ˆ Q and r , we haveˆ c ( i , f ∗ ( i )) Z ∞ e − β t − X j ∈ ˆ S ˆ Q ( t , j | i , f ∗ ( i )) d t + X j ∈ ˆ S Z ∞ e − β t ˆ Q (d t , j | i , f ∗ ( i )) U ∗ ( j ) = c ( i ) Z ∞ e − β t − X j ∈ S Q ( t , j | i ) d t + X j ∈ S Z ∞ e − β t Q (d t , j | i ) V ∗ ( j ) = U ∗ ( i ) . Hence, (4.3) holds and f ∗ is an optimal deterministic stationary policy of SMDPs. Denoteby τ ∗ : = τ f ∗ the stopping time induced by the policy f ∗ , which is an optimal stoppingtime of SMPs according to Theorem 4.2. If S ∗ , ∅ , according to Theorem 4.2, for each ω = ( i , t , i , . . . , t n , i n , . . . ) ∈ Ω , it holds that τ ∗ ( ω ) = τ f ∗ ( ω ) = inf { n ∈ N : f ∗ ( i n ) = } = inf { n ∈ N : i n ∈ S ∗ } . If S ∗ = ∅ , for all ω ∈ Ω , f ∗ ( i n ) = ∆ ( i n ) = ∀ n > τ ∗ ( ω ) = τ f ∗ ( ω ) = inf { n ∈ N : f ∗ ( i n ) = } = + ∞ , and the proof is achieved. (cid:3) The condition of Theorem 4.3 requires the value function V ∗ ( i ) of SMPs, but in practi-cal application, the value function is often unknown. Intuitively, we can replace the valuefunction V ∗ ( i ) by the approximation function V ∗ n ( i ), which is obtained by the iterative al-gorithm given in Theorem 4.2. Therefore, the concept of optimal stopping time will bereplaced by ε -optimal, that is the following definition. Definition 4.4.
Given any ε > , a stopping time of SMPs τ ε is called ε -optimal if it holdsthat V τ ε ( i ) − V ∗ ( i ) ε , for all i ∈ S , where V ∗ ( i ) is the value function of SMPs and V τ ε ( i ) is the expected discounted cost given in (2.8). The following theorem shows that for any ε >
0, we can iterate enough times andget a ε -optimal stopping time under some conditions. For the convenience of statement,we give two notations, i.e. || f || : = sup x ∈ E | f ( x ) | for any function f defined on the set E ; ⌊ x ⌋ : = max { n ∈ N : n x } for any x ∈ [0 , + ∞ ). Theorem 4.5.
Suppose that Assumption 2.1 holds and c ( i ) and g ( i ) are bounded. For any ε > , the number of iterations N ε is given byN ε : = $ log( ε ( ǫ − e − βδ ∗ )) − log( β − || c || + || g || + − ǫ + ǫ e − βδ ∗ ) % , (4.12)18 here ǫ , δ are introduced in Assumption 2.1 and δ ∗ = min { δ, / } . Let V ∗ N ε ( i ) be theN ε -th step iterative function given in Theorem 4.2 and S ε be the subset of S given byS ε : = n i ∈ S : g ( i ) = T V ∗ N ε ( i ) o . Then, the following statements hold.(a). Denote by τ ε the hitting time of S ε , i.e. τ ε ( ω ) = + ∞ , S ε = ∅ ;inf { n ∈ N : i n ∈ S ε } , S ε , ∅ . Then, τ ε is an ε -optimal stopping time of SMPs.(b). If it holds that inf i ∈ S \ S ε (cid:16) g ( i ) − T V ∗ N ε ( i ) (cid:17) > ε, (4.13) then, τ ε is also the optimal stopping time of SMPs. Proof . ( a ). Using Lemma 4.1, for each i ∈ ˆ S and a ∈ A ( i ), we have X j ∈ ˆ S Z ∞ e − β t ˆ Q ( t , j | i , a )d t = Z δ ∗ e − β t X j ∈ ˆ S ˆ Q (d t , j | i , a ) + Z ∞ δ ∗ e − β t X j ∈ ˆ S ˆ Q (d t , j | i , a ) (cid:16) − e − βδ ∗ (cid:17) X j ∈ ˆ S ˆ Q ( δ ∗ , j | i , a ) + e − βδ ∗ − ǫ + ǫ e − βδ ∗ < , (4.14)where the second inequality depends on (4.4). Denote by γ : = − ǫ + ǫ e − βδ ∗ <
1. In thesame way, we obtain P j ∈ S R ∞ e − β t Q (d t , j | i ) γ . Since V ∗ n ( i ) − V ∗ n − ( i ) (cid:13)(cid:13)(cid:13) V ∗ n − V ∗ n − (cid:13)(cid:13)(cid:13) foreach n > i ∈ S , by the definition of T in (4.5), we have T V ∗ n ( i ) min ( g ( i ) , c ( i ) Z ∞ e − β t − X j ∈ S Q ( t , j | i ) d t + X j ∈ S Z ∞ e − β t V ∗ n − ( j ) Q (d t , j | i ) + (cid:13)(cid:13)(cid:13) V ∗ n − V ∗ n − (cid:13)(cid:13)(cid:13) X j ∈ S Z ∞ e − β t Q (d t , j | i ) ) min ( g ( i ) , c ( i ) Z ∞ e − β t − X j ∈ S Q ( t , j | i ) d t + X j ∈ S Z ∞ e − β t V ∗ n − ( j ) Q (d t , j | i ) ) + (cid:13)(cid:13)(cid:13) V ∗ n − V ∗ n − (cid:13)(cid:13)(cid:13) X j ∈ S Z ∞ e − β t Q (d t , j | i ) T V ∗ n − ( i ) + γ (cid:13)(cid:13)(cid:13) V ∗ n − V ∗ n − (cid:13)(cid:13)(cid:13) . U ∗ n ( i ) be the function given in (4.1) on Lemma 4.1, and then using (4.8), we have U ∗ n ( i ) = V ∗ n ( i ) for all i ∈ S and n > −
1. Hence, U ∗ n ( i ) − U ∗ n − ( i ) = T V ∗ n − ( i ) − T V ∗ n − ( i ) γ (cid:13)(cid:13)(cid:13) V ∗ n − − V ∗ n − (cid:13)(cid:13)(cid:13) = γ (cid:13)(cid:13)(cid:13) U ∗ n − − U ∗ n − (cid:13)(cid:13)(cid:13) , which implies that (since U ∗− ( i ) ≡ (cid:13)(cid:13)(cid:13) U ∗ n − U ∗ n − (cid:13)(cid:13)(cid:13) γ n (cid:13)(cid:13)(cid:13) U ∗ (cid:13)(cid:13)(cid:13) γ n (cid:16) β − k c k + k g k + (cid:17) . (4.15)For each i ∈ ˆ S , we define f ε ( i ) : = S ε ∪{ ∆ } . Similar to the proof of Theorem 4.3, weobtain f ε ∈ Π DS . Moreover, for each n >
1, by induction, we show that U ∗ N ε + ( i ) > n − X m = ˆ E f ε i h ( e − β ˆ S m − e − β ˆ S m + )ˆ c ( ˆ X m , ˆ A m ) i + ˆ E f ε i h e − β ˆ S n U ∗ N ε + ( ˆ X n ) i − n − X m = γ m + (cid:13)(cid:13)(cid:13) U ∗ N ε + − U ∗ N ε (cid:13)(cid:13)(cid:13) . (4.16)The key of (4.16) is that, by Lemma 4.1, it holds U ∗ N ε + ( i n ) > ˆ c ( i n , f ε ( i n )) Z ∞ e − β t − X i n + ∈ ˆ S ˆ Q ( t , i n + | i n , f ε ( i n )) d t + X i n + ∈ ˆ S Z ∞ e − β t ˆ Q (d t , i n + | i n , f ε ( i n )) U ∗ N ε + ( i n + ) − γ (cid:13)(cid:13)(cid:13) U ∗ N ε + − U ∗ N ε (cid:13)(cid:13)(cid:13) = ˆ E f ε i (cid:20)(cid:16) − e − β ( ˆ S n + − ˆ S n ) (cid:17) ˆ c ( ˆ X n , ˆ A n ) (cid:12)(cid:12)(cid:12)(cid:12) ˆ X n = i n (cid:21) + ˆ E f ε i (cid:20) e − β ( ˆ S n + − ˆ S n ) U ∗ N ε + ( ˆ X n + ) (cid:12)(cid:12)(cid:12)(cid:12) ˆ X n = i n (cid:21) − γ (cid:13)(cid:13)(cid:13) U ∗ N ε + − U ∗ N ε (cid:13)(cid:13)(cid:13) . Hence, the second item of (4.16) satisfiesˆ E f ε i h e − β ˆ S n U ∗ N ε + ( ˆ X n ) i > ˆ E f ε i h(cid:16) e − β ˆ S n − e − β ˆ S n + (cid:17) ˆ c ( ˆ X n , ˆ A n ) i + ˆ E f ε i h e − β ˆ S n + U ∗ N ε + ( ˆ X n + ) i + γ n + (cid:13)(cid:13)(cid:13) U ∗ N ε + − U ∗ N ε (cid:13)(cid:13)(cid:13) . Hence, passing the limit n → ∞ , it holds that U ∗ ( i ) > U ∗ N ε + ( i ) > U f ε ( i ) − γ − γ || U ∗ N ε + − U ∗ N ε || , ∀ i ∈ ˆ S . (4.17)Using (4.15) and the definition of N ε , it can be verified that γ − γ (cid:13)(cid:13)(cid:13) U ∗ N ε + − U ∗ N ε (cid:13)(cid:13)(cid:13) ε .Then, for each i ∈ ˆ S , we have U ∗ ( i ) > U f ε ( i ) − ε . In the same method of Theorem 4.3, we20an verify that the hitting time τ ε of S ε satisfies τ ε = τ f ε , where τ f ε is the stopping timeinduced by f ε . Moreover, using Theorem 3.7 and Theorem 4.2, we have V ∗ ( i ) = U ∗ ( i ) > U f ε ( i ) − ε = E i [ R τ ε ] − ε = V τ ε ( i ) − ε, which means that τ ε is an ε -optimal stopping time of SMPs.( b ). If S \ S ε = ∅ , then S = S ε and the condition (4.13) holds naturally. Hence, foreach i ∈ S , we have g ( i ) = T V ∗ N ε ( i ) T V ∗ ( i ). That means S = S ∗ = S ε , where S ∗ givenin (4.10).Next, we consider the case S \ S ε , ∅ . Again, using the monotonicity of T , we have S ε ⊂ S ∗ . Conversely, by the definition of V ∗ N ε ( i ) and (4.17), it holds that T V ∗ N ε ( i ) = V ∗ N ε + ( i ) > U f ε ( i ) − ε > U ∗ ( i ) − ε = V ∗ ( i ) − ε, i.e. V ∗ ( i ) T V ∗ N ε ( i ) + ε for each i ∈ S . For each i ∈ S \ S ε , the condition (4.13) impliesthat g ( i ) − T V ∗ ( i ) = g ( i ) − V ∗ ( i ) > g ( i ) − T V ∗ N ε − ε > , which means i ∈ S \ S ∗ . Hence, we have S ∗ ⊂ S ε and then S ∗ = S ε . Finally, we have τ ε = τ ∗ , which is an optimal stopping time given in Theorem 4.3. (cid:3) In this section, we consider a specific example of the optimal stopping time of SMPs, thatis, the maintenance system. We will illustrate how to calculate the value function and theoptimal stopping time by the algorithm.
Example 5.1.
The repairable maintenance system is made up by three states, say , and , which represent “normal operation”, “minor failure” and “serious failure” respec-tively. At each state i ∈ { , , } , the decision maker has two choices, either to maintainthe system or to stop using it. If the decision maker chooses to maintain the system, itwill incur maintenance cost with rate c ( i ) . After a random period of time, which obeys theexponential distribution with parameter u ( i ) > , the system transfers to state j ∈ { , , } with probability p i j . Otherwise, if the decision maker chooses to stop using the system, aterminal cost g ( i ) will have to be paid. And then the system stops running and there is noneed to pay any more. e now model this problem as an optimal stopping time problem of SMPs. Let S = { , , } be the state space. The semi-Markov kernel is given by Q ( t , j | i ) = p i j (cid:16) − e − u ( i ) t (cid:17) for each t > and i , j ∈ S , where p i j is the transition probability and u ( i ) is the parameterof exponential distribution. The cost c ( i ) and the terminal cost g ( i ) are determined by thedata in the system. Next, to illustrate the e ff ectiveness of Theorem 4.2 and Theorem 4.3,we consider a numerical example. Fixed the discount factor with β = . . Assume thatp i j , u ( i ) , c ( i ) and g ( i ) satisfy thatp i j = . .
15 0 . . . . . . . ; u ( i ) = (0 . , ,
1) ; c ( i ) = (5 , , g ( i ) = (300 , , . Under the data above, the Assumption 2.1 holds in this example with δ = and ǫ = min { e − . , e − , e − } . Hence, we have γ = − e − + e − . . Then, let V ∗− ( i ) = andV ∗ n ( i ) = T V ∗ n − ( i ) for n > , where T is the operator defined in (4.5). Denote by V ∗ ( i ) = lim n →∞ V ∗ n ( i ) . When the number of iterations n = in Matlab , we get that k V ∗ − V ∗ k − . Let ε = − . Then we haveV ∗ (1) = . , V ∗ (2) = . , V ∗ (3) = and k V ∗ − V ∗ k ε . Next step, we consider the set S ε = n i ∈ S : g ( i ) = T V ∗ ( i ) o given inTheorem 4.5. To do so, by numerical calculation, we haveg (1) − c (1) Z ∞ e − . t − X j ∈ S Q ( t , j | d t + X j ∈ S Z ∞ e − . t Q (d t , j | V ∗ ( j ) > − , g (2) − c (2) Z ∞ e − . t − X j ∈ S Q ( t , j | d t + X j ∈ S Z ∞ e − . t Q (d t , j | V ∗ ( j ) > − , c (3) Z ∞ e − . t − X j ∈ S Q ( t , j | d t + X j ∈ S Z ∞ e − . t Q (d t , j | V ∗ ( j ) = > g (3) . Hence, S ∗ = S ε = { } and the optimal stopping time of this maintenance system is givenby τ ∗ ( ω ) = inf { n ∈ N : i n = } , ω = ( i , t . . . , i n , t n + , . . . ) ∈ Ω . eferences [1] B¨ auerle , N. and P opp , A. (2018). Risk-sensitive stopping problems for continuous-time Markov chains. Stochastics. An International Journal of Probability andStochastic Processes. (3), 411-431.[2] B¨ auerle , N. and R ieder , U. (2011). Markov Decision Processes with Applicationsto Finance. Springer, Heidelberg.[3] B oshuizen , F. A. and G ouweleeuw , J. M. (1993). General optimal stopping theoremsfor semi-Markov processes. Advances in Applied Probability. (4), 825-846.[4] C ekyay , B. and O zekici , S. (2010). Mean time to failure and availability of semi-Markov missions with maximal repair. European Journal Operational Research. (3), 1442-1454.[5] C how , Y. S., R obbins , H. and S iegmund , D. (1991). Great Expectations: The Theoryof Optimal Stopping. Houghton Mi ffl in Company Boston.[6] D ochviri , B. (1995). On optimal stopping of inhomogeneous standard Markov pro-cesses. Georgian Mathematical Journal. (4), 335-346.[7] H ern ´ andez -L erma , O. and L asserre , J. B. (1996). Discrete-time Markov ControlProcesses: Basic Optimality Criteria. Springer-Verlag, New York.[8] H uang , Y. H. and G uo , X. P. (2010). Discounted semi-Markov decision processeswith nonnegative costs. Acta Mathematica Sinica (Chinese Series). (3), 503-514.[9] H uang , Y. H. and G uo , X. P. (2011). Finite horizon semi-Markov decision pro-cesses with application to maintenance systems. European Journal of OperationalResearch. (1), 131-140.[10] J a ´ skiewicz , A. and N owak , A. S. (2006). Optimality in Feller semi-Markov controlprocesses. Operations Research Letters. (6), 713-718.[11] K itaev , M. Y. (1986). Semi-Markov and jump Markov controlled models: averagecost criterion. Theory of Probability and Its Applications. (2), 272-288.2312] K itaev , M. Y. and R ykov , V. (1995). Controlled Queueing Systems. CRC Press, BocaRaton, FL.[13] L eung , T., Y amazaki , K. and Z hang , H. (2015). Optimal multiple stopping with neg-ative discount rate and random refraction times under L´evy models. SIAM Journalon Control and Optimization. (4), 2373-2405.[14] L imnios , N. and O prisan , G. (2001). Semi-Markov processes and reliability.Birkh¨auser, Boston.[15] N ikolaev , M. L. (1999). On optimal multiple stopping of Markov sequences. Theoryof Probability and its Applications. (2), 298-306.[16] P uterman , M. L. (1994). Markov decision processes: discrete stochastic dynamicprogramming. John Wiley & Sons, Inc, New York.[17] P eskir , G. and S hiryaev , A. (2006). Optimal Stopping and Free-Boundary Problems. Lectures in Mathematics Eth Z¨urich. , 123-142.[18] R oss , S. M. (1970). Average cost semi-Markov decision processes. Journal of Ap-plied Probability. (3), 649-656.[19] S iegmund , D. (1967). Some problems in the theory of optimal stopping rules. Annalsof Mathematical Statistics. , 1627-1640.[20] S nell , J. L. (1952). Applications of martingale system theorems. Transactions of theAmerican Mathematical Society. , 293-312.[21] Y e , L. (2017). Value function and optimal rule on the optimal stopping problem forcontinuous-time Markov processes. Chinese Journal of Mathematics. hitlukhin , M. V. and S hiryaev , A. N. (2014). On the existence of solutions of un-bounded optimal stopping problems. Proceedings of the Steklov Institute of Mathe-matics.287