期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Successive approximations for Markov decision processes and Markov games with unbounded rewards

《Optimization》2012,61(3):431-455

The aim of this paper is to give a survey of recent developments in the area of successive approximations for Markov decision processes and Markov games. We will emphasize two aspects, viz. the conditions under which successive approximations converge in some strong sense and variations of these methods which diminish the amount of computational work to be executed. With respect to the first aspect it will be shown how much unboundedness of the rewards may be allowed without violation of the convergence

With respect to the second aspect we will present four ideas, that can be applied in conjunction, which may diminish the amount of work to be done. These ideas are: 1. the use of the actual convergence of the iterates for the construction of upper and lower bounds (Macqueen bounds), 2. the use of alternative policy improvement procedures (based on stopping times), 3. a better evaluation of the values of actual policies in each iteration step by a value oriented approach, 4. the elimination of suboptimal actions not only permanently, but also temporarily. The general presentation is given for Markov decision processes with a final section devoted to the possibilities of extension to Markov games. 相似文献

2.

Successive approximations in partially observable controlled Markov chains with risk-sensitive average criterion

《Stochastics An International Journal of Probability and Stochastic Processes》2013,85(6):537-568

Partially observable Markov decision chains with finite state, action and signal spaces are considered. The performance index is the risk-sensitive average criterion and, under conditions concerning reachability between the unobservable states and observability of the signals, it is shown that the value iteration algorithm can be implemented to approximate the optimal average cost, to determine a stationary policy whose performance index is arbitrarily close to the optimal one, and to establish the existence of solutions to the optimality equation. The results rely on an appropriate extension of the well-known Schweitzer's transformation. 相似文献

3.

The optimality equation and ε-optimal strategies in Markov games with average reward criterion 总被引：1，自引：0，他引：1

Heinz-Uwe Küenle Ronald Schurath 《Mathematical Methods of Operations Research》2003,56(3):451-471

相似文献

4.

An improved algorithm for solving communicating average reward Markov decision processes 总被引：1，自引：0，他引：1

Moshe Haviv Martin L. Puterman 《Annals of Operations Research》1991,28(1):229-242

This paper provides a policy iteration algorithm for solving communicating Markov decision processes (MDPs) with average reward criterion. The algorithm is based on the result that for communicating MDPs there is an optimal policy which is unichain. The improvement step is modified to select only unichain policies; consequently the nested optimality equations of Howard's multichain policy iteration algorithm are avoided. Properties and advantages of the algorithm are discussed and it is incorporated into a decomposition algorithm for solving multichain MDPs. Since it is easier to show that a problem is communicating than unichain we recommend use of this algorithm instead of unichain policy iteration.This research has been partially supported by NSERC Grant A-5527. 相似文献

5.

Adaptive aggregation for reinforcement learning in average reward Markov decision processes

Ronald Ortner 《Annals of Operations Research》2013,208(1):321-336

We present an algorithm which aggregates online when learning to behave optimally in an average reward Markov decision process. The algorithm is based on the reinforcement learning algorithm UCRL and uses confidence intervals for aggregating the state space. We derive bounds on the regret our algorithm suffers with respect to an optimal policy. These bounds are only slightly worse than the original bounds for UCRL. 相似文献

6.

Optimal switching problem for countable Markov chains: average reward criterion

Alexander Yushkevich 《Mathematical Methods of Operations Research》2001,53(1):1-24

相似文献

7.

Continuous time Markov decision programming with average reward criterion and unbounded reward rate

Shaohui Zheng 《应用数学学报(英文版)》1991,7(1):6-16

This paper deals with the continuous time Markov decision programming (briefly CTMDP) with unbounded reward rate. The economic criterion is the long-run average reward. To the models with countable state space and compact metric action sets, we present a set of sufficient conditions to ensure the existence of the stationary optimal policies.This paper was prepared with the support of the National Youth Science Foundation. 相似文献

8.

Approximate receding horizon approach for Markov decision processes: average reward case

Hyeong Soo Chang 《Journal of Mathematical Analysis and Applications》2003,286(2):636-651

We consider an approximation scheme for solving Markov decision processes (MDPs) with countable state space, finite action space, and bounded rewards that uses an approximate solution of a fixed finite-horizon sub-MDP of a given infinite-horizon MDP to create a stationary policy, which we call “approximate receding horizon control.” We first analyze the performance of the approximate receding horizon control for infinite-horizon average reward under an ergodicity assumption, which also generalizes the result obtained by White (J. Oper. Res. Soc. 33 (1982) 253-259). We then study two examples of the approximate receding horizon control via lower bounds to the exact solution to the sub-MDP. The first control policy is based on a finite-horizon approximation of Howard's policy improvement of a single policy and the second policy is based on a generalization of the single policy improvement for multiple policies. Along the study, we also provide a simple alternative proof on the policy improvement for countable state space. We finally discuss practical implementations of these schemes via simulation. 相似文献

9.

Nonzero-sum games for continuous-time Markov chains with unbounded transition and average payoff rates

WenZhao Zhang XianPing Guo 《中国科学数学(英文版)》2012,55(11):2405-2416

This paper attempts to study two-person nonzero-sum games for denumerable continuous-time Markov chains determined by transition rates,with an expected average criterion.The transition rates are allowed to be unbounded,and the payoff functions may be unbounded from above and from below.We give suitable conditions under which the existence of a Nash equilibrium is ensured.More precisely,using the socalled "vanishing discount" approach,a Nash equilibrium for the average criterion is obtained as a limit point of a sequence of equilibrium strategies for the discounted criterion as the discount factors tend to zero.Our results are illustrated with a birth-and-death game. 相似文献

10.

Recursive adaptive control of Markov decision processes with the average reward criterion

Rolando Cavazos-Cadena Onésimo Hernández-Lerma 《Applied Mathematics and Optimization》1991,23(1):193-207

We are concerned with Markov decision processes with Borel state and action spaces; the transition law and the reward function depend on anunknown parameter. In this framework, we study therecursive adaptive nonstationary value iteration policy, which is proved to be optimal under thesame conditions usually imposed to obtain the optimality of other well-knownnonrecursive adaptive policies. The results are illustrated by showing the existence of optimal adaptive policies for a class of additive-noise systems with unknown noise distribution.This research was supported in part by the Consejo Nacional de Ciencia y Tecnología under Grants PCEXCNA-050156 and A128CCOEO550, and in part by the Third World Academy of Sciences under Grant TWAS RG MP 898-152. 相似文献

11.

Denumerable controlled Markov chains with average reward criterion: Sample path optimality

Rolando Cavazos-Cadena Emmanuel Fernández-Gaucherand 《Mathematical Methods of Operations Research》1995,41(1):89-108

We consider discrete-time nonlinear controlled stochastic systems, modeled by controlled Makov chains with denumerable state space and compact action space. The corresponding stochastic control problem of maximizing average rewards in the long-run is studied. Departing from the most common position which usesexpected values of rewards, we focus on a sample path analysis of the stream of states/rewards. Under a Lyapunov function condition, we show that stationary policies obtained from the average reward optimality equation are not only average reward optimal, but indeed sample path average reward optimal, for almost all sample paths.Research supported by a U.S.-México Collaborative Research Program funded by the National Science Foundation under grant NSF-INT 9201430, and by CONACyT-MEXICO.Partially supported by the MAXTOR Foundation for applied Probability and Statistics, under grant No. 01-01-56/04-93.Research partially supported by the Engineering Foundation under grant RI-A-93-10, and by a grant from the AT&T Foundation. 相似文献

12.

Successive approximations in complex analysis

Shih Mau-Hsiang Yeh Cheh-Chih 《Journal of Mathematical Analysis and Applications》1981,81(1):182-188

相似文献

13.

Successive approximations for solutions of functional integral equations

Hung-Yih Chen 《Journal of Mathematical Analysis and Applications》1981,80(1):19-30

相似文献

14.

Successive approximations for a time dependent scattering process

C.V Pao 《Journal of Mathematical Analysis and Applications》1975,49(3):545-560

相似文献

15.

Existence of optimal stationary policies in average reward Markov decision processes with a recurrent state

Rolando Cavazos-Cadena 《Applied Mathematics and Optimization》1992,26(2):171-194

We consider discrete-timeaverage reward Markov decision processes with denumerable state space andbounded reward function. Under structural restrictions on the model the existence of an optimal stationary policy is proved; both the lim inf and lim sup average criteria are considered. In contrast to the usual approach our results donot rely on the average regard optimality equation. Rather, the arguments are based on well-known facts fromRenewal Theory.This research was supported in part by the Consejo Nacional de Ciencia y Tecnologia (CONACYT) under Grants PCEXCNA 040640 and 050156, and by SEMAC under Grant 89-1/00ifn$. 相似文献

16.

An average lexicographic value for cooperative games

Stef Tijs Peter Borm Edwin Lohmann Marieke Quant 《European Journal of Operational Research》2011,213(1):16-220

For games with a non-empty core the Alexia value is introduced, a value which averages the lexicographic maxima of the core. It is seen that the Alexia value coincides with the Shapley value for convex games, and with the nucleolus for strongly compromise admissible games and big boss games. For simple flow games, clan games and compromise stable games an explicit expression and interpretation of the Alexia value is derived. Furthermore it is shown that the reverse Alexia value, defined by averaging the lexicographic minima of the core, coincides with the Alexia value for convex games and compromise stable games. 相似文献

17.

Randomized stopping games and Markov market games

Elżbieta Z. Ferenstein 《Mathematical Methods of Operations Research》2007,66(3):531-544

We study nonzero-sum stopping games with randomized stopping strategies. The existence of Nash equilibrium and ɛ-equilibrium strategies are discussed under various assumptions on players random payoffs and utility functions dependent on the observed discrete time Markov process. Then we will present a model of a market game in which randomized stopping times are involved. The model is a mixture of a stochastic game and stopping game. Research supported by grant PBZ-KBN-016/P03/99. 相似文献

18.

Finite approximations of Markov operators

《Journal of Computational and Applied Mathematics》2002,147(1):137-152

In this paper we survey some recent developments in the numerical analysis of Markov operators, and in particular Frobenius–Perron operators associated with chaotic discrete dynamical systems. 相似文献

19.

A priori bounds for approximations of Markov programs

Ward Whitt 《Journal of Mathematical Analysis and Applications》1979,71(1):297-302

This note determines a priori bounds for B. L. Fox's [J. Math. Anal. Appl., 34 (1971), 665–670] scheme of approximating discounted Markov programs, thus refining bounds recently obtained by D. J. White (Notes in Descision Theory No. 43, University of Manchester, 1977). The approximation scheme focuses careful attention on only a subset of the state space and uses a fixed function to characterize future returns outside the designated subset. The a priori bounds are useful to design the specific approximation, that is, to select the appropriate subset on which the approximation is based. 相似文献

20.

The average tree solution for multi-choice forest games

S. Béal A. Lardon E. Rémila P. Solal 《Annals of Operations Research》2012,196(1):27-51

In this article we study cooperative multi-choice games with limited cooperation possibilities, represented by an undirected forest on the player set. Players in the game can cooperate if they are connected in the forest. We introduce a new (single-valued) solution concept which is a generalization of the average tree solution defined and characterized by Herings et?al. (Games Econ. Behav. 62:77?C92, 2008) for TU-games played on a forest. Our solution is characterized by component efficiency, component fairness and independence on the greatest activity level. It belongs to the precore of a restricted multi-choice game whenever the underlying multi-choice game is superadditive and isotone. We also link our solution with the hierarchical outcomes (Demange in J. Polit. Econ. 112:754?C778, 2004) of some particular TU-games played on trees. Finally, we propose two possible economic applications of our average tree solution. 相似文献