首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
《Optimization》2012,61(3):431-455
The aim of this paper is to give a survey of recent developments in the area of successive approximations for Markov decision processes and Markov games. We will emphasize two aspects, viz. the conditions under which successive approximations converge in some strong sense and variations of these methods which diminish the amount of computational work to be executed. With respect to the first aspect it will be shown how much unboundedness of the rewards may be allowed without violation of the convergence

With respect to the second aspect we will present four ideas, that can be applied in conjunction, which may diminish the amount of work to be done. These ideas are: 1. the use of the actual convergence of the iterates for the construction of upper and lower bounds (Macqueen bounds), 2. the use of alternative policy improvement procedures (based on stopping times), 3. a better evaluation of the values of actual policies in each iteration step by a value oriented approach, 4. the elimination of suboptimal actions not only permanently, but also temporarily. The general presentation is given for Markov decision processes with a final section devoted to the possibilities of extension to Markov games.  相似文献   

2.
Partially observable Markov decision chains with finite state, action and signal spaces are considered. The performance index is the risk-sensitive average criterion and, under conditions concerning reachability between the unobservable states and observability of the signals, it is shown that the value iteration algorithm can be implemented to approximate the optimal average cost, to determine a stationary policy whose performance index is arbitrarily close to the optimal one, and to establish the existence of solutions to the optimality equation. The results rely on an appropriate extension of the well-known Schweitzer's transformation.  相似文献   

3.
4.
This paper provides a policy iteration algorithm for solving communicating Markov decision processes (MDPs) with average reward criterion. The algorithm is based on the result that for communicating MDPs there is an optimal policy which is unichain. The improvement step is modified to select only unichain policies; consequently the nested optimality equations of Howard's multichain policy iteration algorithm are avoided. Properties and advantages of the algorithm are discussed and it is incorporated into a decomposition algorithm for solving multichain MDPs. Since it is easier to show that a problem is communicating than unichain we recommend use of this algorithm instead of unichain policy iteration.This research has been partially supported by NSERC Grant A-5527.  相似文献   

5.
We present an algorithm which aggregates online when learning to behave optimally in an average reward Markov decision process. The algorithm is based on the reinforcement learning algorithm UCRL and uses confidence intervals for aggregating the state space. We derive bounds on the regret our algorithm suffers with respect to an optimal policy. These bounds are only slightly worse than the original bounds for UCRL.  相似文献   

6.
7.
This paper deals with the continuous time Markov decision programming (briefly CTMDP) with unbounded reward rate. The economic criterion is the long-run average reward. To the models with countable state space and compact metric action sets, we present a set of sufficient conditions to ensure the existence of the stationary optimal policies.This paper was prepared with the support of the National Youth Science Foundation.  相似文献   

8.
We consider an approximation scheme for solving Markov decision processes (MDPs) with countable state space, finite action space, and bounded rewards that uses an approximate solution of a fixed finite-horizon sub-MDP of a given infinite-horizon MDP to create a stationary policy, which we call “approximate receding horizon control.” We first analyze the performance of the approximate receding horizon control for infinite-horizon average reward under an ergodicity assumption, which also generalizes the result obtained by White (J. Oper. Res. Soc. 33 (1982) 253-259). We then study two examples of the approximate receding horizon control via lower bounds to the exact solution to the sub-MDP. The first control policy is based on a finite-horizon approximation of Howard's policy improvement of a single policy and the second policy is based on a generalization of the single policy improvement for multiple policies. Along the study, we also provide a simple alternative proof on the policy improvement for countable state space. We finally discuss practical implementations of these schemes via simulation.  相似文献   

9.
This paper attempts to study two-person nonzero-sum games for denumerable continuous-time Markov chains determined by transition rates,with an expected average criterion.The transition rates are allowed to be unbounded,and the payoff functions may be unbounded from above and from below.We give suitable conditions under which the existence of a Nash equilibrium is ensured.More precisely,using the socalled "vanishing discount" approach,a Nash equilibrium for the average criterion is obtained as a limit point of a sequence of equilibrium strategies for the discounted criterion as the discount factors tend to zero.Our results are illustrated with a birth-and-death game.  相似文献   

10.
We are concerned with Markov decision processes with Borel state and action spaces; the transition law and the reward function depend on anunknown parameter. In this framework, we study therecursive adaptive nonstationary value iteration policy, which is proved to be optimal under thesame conditions usually imposed to obtain the optimality of other well-knownnonrecursive adaptive policies. The results are illustrated by showing the existence of optimal adaptive policies for a class of additive-noise systems with unknown noise distribution.This research was supported in part by the Consejo Nacional de Ciencia y Tecnología under Grants PCEXCNA-050156 and A128CCOEO550, and in part by the Third World Academy of Sciences under Grant TWAS RG MP 898-152.  相似文献   

11.
We consider discrete-time nonlinear controlled stochastic systems, modeled by controlled Makov chains with denumerable state space and compact action space. The corresponding stochastic control problem of maximizing average rewards in the long-run is studied. Departing from the most common position which usesexpected values of rewards, we focus on a sample path analysis of the stream of states/rewards. Under a Lyapunov function condition, we show that stationary policies obtained from the average reward optimality equation are not only average reward optimal, but indeed sample path average reward optimal, for almost all sample paths.Research supported by a U.S.-México Collaborative Research Program funded by the National Science Foundation under grant NSF-INT 9201430, and by CONACyT-MEXICO.Partially supported by the MAXTOR Foundation for applied Probability and Statistics, under grant No. 01-01-56/04-93.Research partially supported by the Engineering Foundation under grant RI-A-93-10, and by a grant from the AT&T Foundation.  相似文献   

12.
13.
14.
15.
We consider discrete-timeaverage reward Markov decision processes with denumerable state space andbounded reward function. Under structural restrictions on the model the existence of an optimal stationary policy is proved; both the lim inf and lim sup average criteria are considered. In contrast to the usual approach our results donot rely on the average regard optimality equation. Rather, the arguments are based on well-known facts fromRenewal Theory.This research was supported in part by the Consejo Nacional de Ciencia y Tecnologia (CONACYT) under Grants PCEXCNA 040640 and 050156, and by SEMAC under Grant 89-1/00ifn$.  相似文献   

16.
For games with a non-empty core the Alexia value is introduced, a value which averages the lexicographic maxima of the core. It is seen that the Alexia value coincides with the Shapley value for convex games, and with the nucleolus for strongly compromise admissible games and big boss games. For simple flow games, clan games and compromise stable games an explicit expression and interpretation of the Alexia value is derived. Furthermore it is shown that the reverse Alexia value, defined by averaging the lexicographic minima of the core, coincides with the Alexia value for convex games and compromise stable games.  相似文献   

17.
We study nonzero-sum stopping games with randomized stopping strategies. The existence of Nash equilibrium and ɛ-equilibrium strategies are discussed under various assumptions on players random payoffs and utility functions dependent on the observed discrete time Markov process. Then we will present a model of a market game in which randomized stopping times are involved. The model is a mixture of a stochastic game and stopping game. Research supported by grant PBZ-KBN-016/P03/99.  相似文献   

18.
In this paper we survey some recent developments in the numerical analysis of Markov operators, and in particular Frobenius–Perron operators associated with chaotic discrete dynamical systems.  相似文献   

19.
This note determines a priori bounds for B. L. Fox's [J. Math. Anal. Appl., 34 (1971), 665–670] scheme of approximating discounted Markov programs, thus refining bounds recently obtained by D. J. White (Notes in Descision Theory No. 43, University of Manchester, 1977). The approximation scheme focuses careful attention on only a subset of the state space and uses a fixed function to characterize future returns outside the designated subset. The a priori bounds are useful to design the specific approximation, that is, to select the appropriate subset on which the approximation is based.  相似文献   

20.
In this article we study cooperative multi-choice games with limited cooperation possibilities, represented by an undirected forest on the player set. Players in the game can cooperate if they are connected in the forest. We introduce a new (single-valued) solution concept which is a generalization of the average tree solution defined and characterized by Herings et?al. (Games Econ. Behav. 62:77?C92, 2008) for TU-games played on a forest. Our solution is characterized by component efficiency, component fairness and independence on the greatest activity level. It belongs to the precore of a restricted multi-choice game whenever the underlying multi-choice game is superadditive and isotone. We also link our solution with the hierarchical outcomes (Demange in J. Polit. Econ. 112:754?C778, 2004) of some particular TU-games played on trees. Finally, we propose two possible economic applications of our average tree solution.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号