共查询到20条相似文献,搜索用时 0 毫秒
1.
《Optimization》2012,61(7):1593-1623
This paper deals with the ratio and time expected average criteria for constrained semi-Markov decision processes (SMDPs). The state and action spaces are Polish spaces, the rewards and costs are unbounded from above and from below, and the mean holding times are allowed to be unbounded from above. First, under general conditions we prove the existence of constrained-optimal policies for the ratio expected average criterion by developing a technique of occupation measures including the mean holding times for SMDPs, which are the generalizations of those for the standard discrete-time and continuous-time MDPs. Then, we give suitable conditions under which we establish the equivalence of the two average criteria by the optional sampling theorem, and thus we show the existence of constrained-optimal policies for the time expected average criterion. Finally, we illustrate the application of our main results with a controlled linear system, for which an exact optimal policy is obtained. 相似文献
2.
Yonghui Huang Xianping Guo 《European Journal of Operational Research》2011,212(1):131-140
This paper investigates finite horizon semi-Markov decision processes with denumerable states. The optimality is over the class of all randomized history-dependent policies which include states and also planning horizons, and the cost rate function is assumed to be bounded below. Under suitable conditions, we show that the value function is a minimum nonnegative solution to the optimality equation and there exists an optimal policy. Moreover, we develop an effective algorithm for computing optimal policies, derive some properties of optimal policies, and in addition, illustrate our main results with a maintenance system. 相似文献
3.
Eugene A. Feinberg 《Mathematical Methods of Operations Research》1994,39(3):257-288
This paper deals with constrained average reward Semi-Markov Decision Processes (SMDPs) with finite state and action sets. We consider two average reward criteria. The first criterion is time-average rewards, which equal the lower limits of the expected average rewards per unit time, as the horizon tends to infinity. The second criterion is ratio-average rewards, which equal the lower limits of the ratios of the expected total rewards during the firstn steps to the expected total duration of thesen steps asn . For both criteria, we prove the existence of optimal mixed stationary policies for constrained problems when the constraints are of the same nature as the objective functions. For unichain problems, we show the existence of randomized stationary policies which are optimal for both criteria. However, optimal mixed stationary policies may be different for each of these critria even for unichain problems. We provide linear programming algorithms for the computation of optimal policies. 相似文献
4.
Yonghui Huang 《Journal of Mathematical Analysis and Applications》2009,359(1):404-140
This paper studies the risk minimization problem in semi-Markov decision processes with denumerable states. The criterion to be optimized is the risk probability (or risk function) that a first passage time to some target set doesn't exceed a threshold value. We first characterize such risk functions and the corresponding optimal value function, and prove that the optimal value function satisfies the optimality equation by using a successive approximation technique. Then, we present some properties of optimal policies, and further give conditions for the existence of optimal policies. In addition, a value iteration algorithm and a policy improvement method for obtaining respectively the optimal value function and optimal policies are developed. Finally, two examples are given to illustrate the value iteration procedure and essential characterization of the risk function. 相似文献
5.
In this paper, we study constrained continuous-time Markov decision processes with a denumerable state space and unbounded
reward/cost and transition rates. The criterion to be maximized is the expected average reward, and a constraint is imposed
on an expected average cost. We give suitable conditions that ensure the existence of a constrained-optimal policy. Moreover,
we show that the constrained-optimal policy randomizes between two stationary policies differing in at most one state. Finally,
we use a controlled queueing system to illustrate our conditions.
Supported by NSFC, NCET and RFDP. 相似文献
6.
This paper presents a basic formula for performance gradient estimation of semi-Markov decision processes (SMDPs) under average-reward criterion. This formula directly follows from a sensitivity equation in perturbation analysis. With this formula, we develop three sample-path-based gradient estimation algorithms by using a single sample path. These algorithms naturally extend many gradient estimation algorithms for discrete-time Markov systems to continuous time semi-Markov models. In particular, they require less storage than the algorithm in the literature. 相似文献
7.
Quanxin Zhu 《Mathematical Methods of Operations Research》2007,66(2):299-313
In this paper, we study the average optimality for continuous-time controlled jump Markov processes in general state and action spaces. The criterion to be minimized is the average expected costs. Both the transition rates and the cost rates are allowed to be unbounded. We propose another set of conditions under which we first establish one average optimality inequality by using the well-known “vanishing discounting factor approach”. Then, when the cost (or reward)
rates are nonnegative (or nonpositive), from the average optimality inequality we prove the existence of an average optimal
stationary policy in all randomized history dependent policies by using the Dynkin formula and the Tauberian theorem. Finally, when the cost (or reward) rates have neither upper nor lower bounds, we also prove the existence of an average optimal policy in all (deterministic) stationary policies by constructing a “new”
cost (or reward) rate.
Research partially supported by the Natural Science Foundation of China (Grant No: 10626021) and the Natural Science Foundation
of Guangdong Province (Grant No: 06300957). 相似文献
8.
《Optimization》2012,61(5):651-670
Optimality problems in infinite horizon, discrete time, vector criterion Markov and semi-Markov decision processes are expressed as standard problems of multiobjective linear programming. Processes with discounting, absorbing processes and completely ergodie processes without discounting are investigated. The common properties and special structure of derived multiobjective linear programming problems are overviewed. Computational simplicities associated with these problems in comparison with general multiobjective linear programming problems are discussed. Methods for solving these problems are overviewed and simple numerical examples are given. 相似文献
9.
This paper considers a first passage model for discounted semi-Markov decision processes with denumerable states and nonnegative costs.The criterion to be optimized is the expected discounted cost incurred during a first passage time to a given target set.We first construct a semi-Markov decision process under a given semi-Markov decision kernel and a policy.Then,we prove that the value function satisfies the optimality equation and there exists an optimal(or e-optimal) stationary policy under suitable conditions by using a minimum nonnegative solution approach.Further we give some properties of optimal policies.In addition,a value iteration algorithm for computing the value function and optimal policies is developed and an example is given.Finally,it is showed that our model is an extension of the first passage models for both discrete-time and continuous-time Markov decision processes. 相似文献
10.
Q. X. Zhu 《Mathematical Methods of Operations Research》2007,65(3):519-538
This paper studies both the average sample-path reward (ASPR) criterion and the limiting average variance criterion for denumerable discrete-time Markov decision processes. The rewards may have neither upper nor lower bounds. We give sufficient conditions on the system’s primitive data and under which we prove the existence of ASPR-optimal stationary policies and variance optimal policies. Our conditions
are weaker than those in the previous literature. Moreover, our results are illustrated by a controlled queueing system.
Research partially supported by the Natural Science Foundation of Guangdong Province (Grant No: 06025063) and the Natural
Science Foundation of China (Grant No: 10626021). 相似文献
11.
Quan-xin Zhu 《应用数学学报(英文版)》2011,27(4):613-624
In this paper we study the average sample-path cost(ASPC) problem for continuous-time Markov decision processes in Polish spaces.To the best of our knowledge,this paper is a first attempt to study the ASPC criterion on continuous-time MDPs with Polish state and action spaces.The corresponding transition rates are allowed to be unbounded,and the cost rates may have neither upper nor lower bounds.Under some mild hypotheses,we prove the existence of ε(ε≥ 0)-ASPC optimal stationary policies based on two differe... 相似文献
12.
This paper establishes a rather complete optimality theory for the average cost semi-Markov decision model with a denumerable state space, compact metric action sets and unbounded one-step costs for the case where the underlying Markov chains have a single ergotic set. Under a condition which, roughly speaking, requires the existence of a finite set such that the supremum over all stationary policies of the expected time and the total expected absolute cost incurred until the first return to this set are finite for any starting state, we shall verify the existence of a finite solution to the average costs optimality equation and the existence of an average cost optimal stationary policy. 相似文献
13.
Solving a semi-Markov decision process (SMDP) using value or policy iteration requires precise knowledge of the probabilistic model and suffers from the curse of dimensionality. To overcome these limitations, we present a reinforcement learning approach where one optimizes the SMDP performance criterion with respect to a family of parameterised policies. We propose an online algorithm that simultaneously estimates the gradient of the performance criterion and optimises it using stochastic approximation. We apply our algorithm to call admission control. Our proposed policy gradient SMDP algorithm and its application to admission control is novel. 相似文献
14.
Masahiko Sakaguchi 《Applied mathematics and computation》2010,216(10):2947-2958
We consider undiscounted semi-Markov decision process with a target set and our main concern is a problem minimizing threshold probability. We formulate the problem as an infinite horizon case with a recurrent class. We show that an optimal value function is a unique solution to an optimality equation and there exists a stationary optimal policy. Also several value iteration methods and a policy improvement method are given in our model. Furthermore, we investigate a relationship between threshold probabilities and expectations for total rewards. 相似文献
15.
Mean,variance, and probabilistic criteria in finite Markov decision processes: A review 总被引:3,自引:0,他引:3
D. J. White 《Journal of Optimization Theory and Applications》1988,56(1):1-29
This paper is a survey of papers which make use of nonstandard Markov decision process criteria (i.e., those which do not seek simply to optimize expected returns per unit time or expected discounted return). It covers infinite-horizon nondiscounted formulations, infinite-horizon discounted formulations, and finite-horizon formulations. For problem formulations in terms solely of the probabilities of being in each state and taking each action, policy equivalence results are given which allow policies to be restricted to the class of Markov policies or to the randomizations of deterministic Markov policies. For problems which cannot be stated in such terms, in terms of the primitive state setI, formulations involving a redefinition of the states are examined.The author would like to thank two referees for a very thorough and helpful referceing of the original article and for the extra references (Refs. 47–52) now added to the original reference list. 相似文献
16.
Hyeong Soo Chang 《Journal of Mathematical Analysis and Applications》2003,286(2):636-651
We consider an approximation scheme for solving Markov decision processes (MDPs) with countable state space, finite action space, and bounded rewards that uses an approximate solution of a fixed finite-horizon sub-MDP of a given infinite-horizon MDP to create a stationary policy, which we call “approximate receding horizon control.” We first analyze the performance of the approximate receding horizon control for infinite-horizon average reward under an ergodicity assumption, which also generalizes the result obtained by White (J. Oper. Res. Soc. 33 (1982) 253-259). We then study two examples of the approximate receding horizon control via lower bounds to the exact solution to the sub-MDP. The first control policy is based on a finite-horizon approximation of Howard's policy improvement of a single policy and the second policy is based on a generalization of the single policy improvement for multiple policies. Along the study, we also provide a simple alternative proof on the policy improvement for countable state space. We finally discuss practical implementations of these schemes via simulation. 相似文献
17.
Douglas John White 《Mathematical Methods of Operations Research》1996,43(3):353-372
In this paper we consider a homotopy deformation approach to solving Markov decision process problems by the continuous deformation of a simpler Markov decision process problem until it is identical with the original problem. Algorithms and performance bounds are given. 相似文献
18.
《Operations Research Letters》2022,50(2):218-223
We consider a finite state-action discounted constrained Markov decision process with uncertain running costs and known transition probabilities. We propose equivalent linear programming, second-order cone programming and semidefinite programming problems for the robust constrained Markov decision processes when the uncertain running cost vectors belong to polytopic, ellipsoidal, and semidefinite cone uncertainty sets, respectively. As an application, we study a variant of a machine replacement problem and perform numerical experiments on randomly generated instances of various sizes. 相似文献
19.
This paper deals with a new optimality criterion consisting of the usual three average criteria and the canonical triplet (totally so-called strong average-canonical optimality criterion) and introduces the concept of a strong average-canonical policy for nonstationary Markov decision processes, which is an extension of the canonical policies of Herna′ndez-Lerma and Lasserre [16] (pages: 77) for the stationary Markov controlled processes. For the case of possibly non-uniformly bounded rewards and denumerable state space, we first construct, under some conditions, a solution to the optimality equations (OEs), and then prove that the Markov policies obtained from the OEs are not only optimal for the three average criteria but also optimal for all finite horizon criteria with a sequence of additional functions as their terminal rewards (i.e. strong average-canonical optimal). Also, some properties of optimal policies and optimal average value convergence are discussed. Moreover, the error bound in average reward between a rolling horizon policy and a strong average-canonical optimal policy is provided, and then a rolling horizon algorithm for computing strong average ε(>0)-optimal Markov policies is given. 相似文献
20.
M. Sun 《Journal of Optimization Theory and Applications》1993,79(2):405-413
We introduce a revised simplex algorithm for solving a typical type of dynamic programming equation arising from a class of finite Markov decision processes. The algorithm also applies to several types of optimal control problems with diffusion models after discretization. It is based on the regular simplex algorithm, the duality concept in linear programming, and certain special features of the dynamic programming equation itself. Convergence is established for the new algorithm. The algorithm has favorable potential applicability when the number of actions is very large or even infinite. 相似文献