首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
We introduce the concept of a Markov risk measure and we use it to formulate risk-averse control problems for two Markov decision models: a finite horizon model and a discounted infinite horizon model. For both models we derive risk-averse dynamic programming equations and a value iteration method. For the infinite horizon problem we develop a risk-averse policy iteration method and we prove its convergence. We also propose a version of the Newton method to solve a nonsmooth equation arising in the policy iteration method and we prove its global convergence. Finally, we discuss relations to min–max Markov decision models.  相似文献   

2.
In this paper, partially observable Markov decision processes (POMDPs) with discrete state and action space under the average reward criterion are considered from a recent-developed sensitivity point of view. By analyzing the average-reward performance difference formula, we propose a policy iteration algorithm with step sizes to obtain an optimal or local optimal memoryless policy. This algorithm improves the policy along the same direction as the policy iteration does and suitable step sizes guarantee the convergence of the algorithm. Moreover, the algorithm can be used in Markov decision processes (MDPs) with correlated actions. Two numerical examples are provided to illustrate the applicability of the algorithm.  相似文献   

3.
Policy iteration is a well-studied algorithm for solving stationary Markov decision processes (MDPs). It has also been extended to robust stationary MDPs. For robust nonstationary MDPs, however, an “as is” execution of this algorithm is not possible because it would call for an infinite amount of computation in each iteration. We therefore present a policy iteration algorithm for robust nonstationary MDPs, which performs finitely implementable approximate variants of policy evaluation and policy improvement in each iteration. We prove that the sequence of cost-to-go functions produced by this algorithm monotonically converges pointwise to the optimal cost-to-go function; the policies generated converge subsequentially to an optimal policy.  相似文献   

4.
易文  徐渝  陈志刚 《运筹与管理》2007,16(6):133-136
技术的动态发展和企业间的竞争对企业新产品策略有很大影响,直接决定新产品的引进周期。本文在产业技术动态变化的随机环境下构建随机动态规划模型,关注产业技术进步、投资成本和产品市场竞争等影响因素,探讨企业进行新产品引进的周期选择,对新产品引进的周期和质量决策进行方法设计和应用举例。利用随机动态规划模型得出新产品引进的最优时间周期,用算例分析技术进步和产品研发成本对企业引进周期策略的影响,采取策略迭代的方法进行求解,发现技术进步较快时企业的新产品引进步伐也较快,研发成本的提高使企业的新产品引入步伐降低。  相似文献   

5.
We introduce a class of models for multidimensional control problems that we call skip-free Markov decision processes on trees. We describe and analyse an algorithm applicable to Markov decision processes of this type that are skip-free in the negative direction. Starting with the finite average cost case, we show that the algorithm combines the advantages of both value iteration and policy iteration—it is guaranteed to converge to an optimal policy and optimal value function after a finite number of iterations but the computational effort required for each iteration step is comparable with that for value iteration. We show that the algorithm can also be used to solve discounted cost models and continuous-time models, and that a suitably modified algorithm can be used to solve communicating models.  相似文献   

6.
无限阶段部分可观察马尔可夫决策规划   总被引:2,自引:0,他引:2  
本文对[1,2]所考虑的无限阶段折扣费用部分可观察马尔可夫决策规划作了进一步的讨论,澄清了其中的一些模糊概念,补充或纠正了其中的疏漏和错误,特别地,在保持费用函数分片线性的原则下扩大了有限瞬时策略类,最后给出了几个新的结论,并对[1]中的策略迭代算法给出了修正及收敛估计。  相似文献   

7.
This paper provides a policy iteration algorithm for solving communicating Markov decision processes (MDPs) with average reward criterion. The algorithm is based on the result that for communicating MDPs there is an optimal policy which is unichain. The improvement step is modified to select only unichain policies; consequently the nested optimality equations of Howard's multichain policy iteration algorithm are avoided. Properties and advantages of the algorithm are discussed and it is incorporated into a decomposition algorithm for solving multichain MDPs. Since it is easier to show that a problem is communicating than unichain we recommend use of this algorithm instead of unichain policy iteration.This research has been partially supported by NSERC Grant A-5527.  相似文献   

8.
Continuous time Markovian decision models with countable state space are investigated. The existence of an optimal stationary policy is established for the expected average return criterion function. It is shown that the expected average return can be expressed as an expected discounted return of a related Markovian decision process. A policy iteration method is given which converges to an optimal deterministic policy, the policy so obtained is shown optimal over all Markov policies.  相似文献   

9.
In this paper, we study discounted Markov decision processes on an uncountable state space. We allow a utility (reward) function to be unbounded both from above and below. A new feature in our approach is an easily verifiable rate of growth condition introduced for a positive part of the utility function. This assumption, in turn, enables us to prove the convergence of a value iteration algorithm to a solution to the Bellman equation. Moreover, by virtue of the optimality equation we show the existence of an optimal stationary policy.  相似文献   

10.
We start with a discussion of coupled algebraic Riccati equations arising in the study of linear-quadratic optimal control problems for Markov jump linear systems. Under suitable assumptions, this system of equations has a unique positive semidefinite solution, which is the solution of practical interest. The coupled equations can be rewritten as a single linearly perturbed matrix Riccati equation with special structures. We study the linearly perturbed Riccati equation in a more general setting and obtain a class of iterative methods from different splittings of a positive operator involved in the Riccati equation. We prove some special properties of the sequences generated by these methods and determine and compare the convergence rates of these methods. Our results are then applied to the coupled Riccati equations of jump linear systems. We obtain linear convergence of the Lyapunov iteration and the modified Lyapunov iteration, and confirm that the modified Lyapunov iteration indeed has faster convergence than the original Lyapunov iteration.  相似文献   

11.
In this paper, we are concerned with a new algorithm for multichain finite state Markov decision processes which finds an average optimal policy through the decomposition of the state space into some communicating classes and a transient class. For each communicating class, a relatively optimal policy is found, which is used to find an optimal policy by applying the value iteration algorithm. Using a pattern matrix determining the behaviour pattern of the decision process, the decomposition of the state space is effectively done, so that the proposed algorithm simplifies the structured one given by the excellent Leizarowitz’s paper (Math Oper Res 28:553–586, 2003). Also, a numerical example is given to comprehend the algorithm.  相似文献   

12.
Performance optimization is considered for average-cost multichain Markov decision processes (MDPs) with compact action set. Since, for a general compact multichain model, the optimality equation system may have no solution, and also a policy iteration algorithm may yield a suboptimal policy rather than an optimal one, we concentrate only on a special case of multichain models in this paper, where we assume that the classifications of states are fixed identically rather than varying with policies. By using the concept of performance potentials, the existence of solutions to the optimality equation system is established, and then a potential-based policy iteration algorithm is supposed to solve this system. In addition, the optimality convergence, for recurrent classes, of the algorithm has been proved. Finally, a numerical example is provided.  相似文献   

13.
The sandwich algorithm (SA) is an alternative to the data augmentation (DA) algorithm that uses an extra simulation step at each iteration. In this paper, we show that the sandwich algorithm always converges at least as fast as the DA algorithm, in the Markov operator norm sense. We also establish conditions under which the spectrum of SA dominates that of DA. An example illustrates the results.  相似文献   

14.
As is well-known, the convergence of the policy iteration algorithm in multichain Markov renewal programming with no discounting depends on the choice of the relative value vectors during the iteration. We present a choice rule which also guarantees the convergence and is weaker than other known rules. Moreover, the computational complexity of the policy iteration algorithm is smaller if this rule is used.
Zusammenfassung Wie bekannt ist, hängt die Konvergenz des Politikiterationsalgorithmus für Semi-Markovsche Entscheidungsprozesse ohne Diskontierung und mit mehreren ergodischen Mengen von der Wahl der relativen Werte während der Iteration ab. Wir geben eine Auswahlvorschrift an, die die Konvergenz garantiert und schwächer ist als andere bekannte Vorschriften. Außerdem ist der Rechenaufwand des Politikiterationsalgorithmus bei Benutzung dieser Vorschrift geringer als bei Benutzung der anderen Vorschriften.
  相似文献   

15.
本文探讨一种求解非线性不适定算子方程的正则化Newton迭代法.本文讨论了这种迭代法在一般条件下的收敛性以及其他的一些性质.这种迭代法结合确定迭代次数的残差准则有局部收敛性.  相似文献   

16.
This paper studies the policy iteration algorithm (PIA) for average cost Markov control processes on Borel spaces. Two classes of MCPs are considered. One of them allows some restricted-growth unbounded cost functions and compact control constraint sets; the other one requires strictly unbounded costs and the control constraint sets may be non-compact. For each of these classes, the PIA yields, under suitable assumptions, the optimal (minimum) cost, an optimal stationary control policy, and a solution to the average cost optimality equation.  相似文献   

17.
The following optimality principle is established for finite undiscounted or discounted Markov decision processes: If a policy is (gain, bias, or discounted) optimal in one state, it is also optimal for all states reachable from this state using this policy. The optimality principle is used constructively to demonstrate the existence of a policy that is optimal in every state, and then to derive the coupled functional equations satisfied by the optimal return vectors. This reverses the usual sequence, where one first establishes (via policy iteration or linear programming) the solvability of the coupled functional equations, and then shows that the solution is indeed the optimal return vector and that the maximizing policy for the functional equations is optimal for every state.  相似文献   

18.
This paper deals with the bias optimality of multichain models for finite continuous-time Markov decision processes. Based on new performance difference formulas developed here, we prove the convergence of a so-called bias-optimal policy iteration algorithm, which can be used to obtain bias-optimal policies in a finite number of iterations.  相似文献   

19.
In this paper, we consider a mean–variance optimization problem for Markov decision processes (MDPs) over the set of (deterministic stationary) policies. Different from the usual formulation in MDPs, we aim to obtain the mean–variance optimal policy that minimizes the variance over a set of all policies with a given expected reward. For continuous-time MDPs with the discounted criterion and finite-state and action spaces, we prove that the mean–variance optimization problem can be transformed to an equivalent discounted optimization problem using the conditional expectation and Markov properties. Then, we show that a mean–variance optimal policy and the efficient frontier can be obtained by policy iteration methods with a finite number of iterations. We also address related issues such as a mutual fund theorem and illustrate our results with an example.  相似文献   

20.
The functional equations of infinite horizon discounted Markov renewal programming arev*=Tv* whereT is a monotone contraction operator. This paper shows how to accelerate convergence of the value-iteration schemev (n+1)=Tv (n) by a block-scaling step whereby all states in a given group have theirv i (n) scaled by a common scale factor. A similar method exists when the relative valuesv* i v*1 are computed iteratively. In both cases, the block scaling factors are solution to a set of functional equations which has similar structure to a discounted Markov renewal program, and can itself be solved by successive approximation, policy iteration, or linear programming.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号