期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Policy structure for discrete time Markov chain disorder problems

《European Journal of Operational Research》1986,26(2):286-294

Markov chain disorder problems are partially observed Markov decision problems where the decision maker must formulate a policy of response to an unobservable transition to an ‘undesirable state’. This paper introduces a structural property for policies—the likelihood consistency property. For certain varieties of Markov chain disorder problems, policies which minimize the expected discounted present cost possess the likelihood consistency property. This extends the list of known policy structure results for partially observed Markov decision processes, and introduces a class of policies which can be used when the ‘optimal’ policy is unattainable or undesirable. 相似文献

2.

Sample path optimality for a Markov optimization problem

《Stochastic Processes and their Applications》2005,115(5):769-779

We study a unichain Markov decision process i.e. a controlled Markov process whose state process under a stationary policy is an ergodic Markov chain. Here the state and action spaces are assumed to be either finite or countable. When the state process is uniformly ergodic and the immediate cost is bounded then a policy that minimizes the long-term expected average cost also has an nth stage sample path cost that with probability one is asymptotically less than the nth stage sample path cost under any other non-optimal stationary policy with a larger expected average cost. This is a strengthening in the Markov model case of the a.s. asymptotically optimal property frequently discussed in the literature. 相似文献

3.

An optimality principle for Markovian decision processes

Paul J Schweitzer Bezalel Gavish 《Journal of Mathematical Analysis and Applications》1976,54(1):173-184

The following optimality principle is established for finite undiscounted or discounted Markov decision processes: If a policy is (gain, bias, or discounted) optimal in one state, it is also optimal for all states reachable from this state using this policy. The optimality principle is used constructively to demonstrate the existence of a policy that is optimal in every state, and then to derive the coupled functional equations satisfied by the optimal return vectors. This reverses the usual sequence, where one first establishes (via policy iteration or linear programming) the solvability of the coupled functional equations, and then shows that the solution is indeed the optimal return vector and that the maximizing policy for the functional equations is optimal for every state. 相似文献

4.

Computational bounds for elevator control policies by large scale linear programming

Stefan Heinz Jörg Rambau Andreas Tuchscherer 《Mathematical Methods of Operations Research》2014,79(1):87-117

We computationally assess policies for the elevator control problem by a new column-generation approach for the linear programming method for discounted infinite-horizon Markov decision problems. By analyzing the optimality of given actions in given states, we were able to provably improve the well-known nearest-neighbor policy. Moreover, with the method we could identify an optimal parking policy. This approach can be used to detect and resolve weaknesses in particular policies for Markov decision problems. 相似文献

5.

Aggregation and disaggregation in Markov decision models for inventory control

L.M.M. Veugen J. van der Wal J. Wessels 《European Journal of Operational Research》1985,20(2):248-254

In this paper the possibility is investigated of using aggregation in the action space for some Markov decision processes of inventory control type. For the standard (s, S) inventory control model the policy improvement procedure can be executed in a very efficient way, therefore, aggregation in the action space is not of much use. However, in situations where the decisions have some aftereffect and, hence, the old decision has to be incorporated in the state, it might be rewarding to aggregate actions. Some variants for aggregation and disaggregation are formulated and analyzed. Numerical evidence is presented. 相似文献

6.

Finding optimal memoryless policies of POMDPs under the expected average reward criterion

Yanjie Li Baoqun Yin 《European Journal of Operational Research》2011,211(3):556-567

In this paper, partially observable Markov decision processes (POMDPs) with discrete state and action space under the average reward criterion are considered from a recent-developed sensitivity point of view. By analyzing the average-reward performance difference formula, we propose a policy iteration algorithm with step sizes to obtain an optimal or local optimal memoryless policy. This algorithm improves the policy along the same direction as the policy iteration does and suitable step sizes guarantee the convergence of the algorithm. Moreover, the algorithm can be used in Markov decision processes (MDPs) with correlated actions. Two numerical examples are provided to illustrate the applicability of the algorithm. 相似文献

7.

Optimal policies for constrained average-cost Markov decision processes

Juan Gonz��lez-Hern��ndez C��sar E. Villarreal 《TOP》2011,19(1):107-120

We give mild conditions for the existence of optimal solutions for a Markov decision problem with average cost, under m constraints of the same kind, in Borel actions and states spaces. Moreover, there is an optimal policy that is a convex combination of at most m+1 deterministic policies. 相似文献

8.

Continuous time control of Markov processes on an arbitrary state space: Average return criterion

Bharat T. Doshi 《Stochastic Processes and their Applications》1976,4(1):55-77

The paper deals with continuous time Markov decision processes on a fairly general state space. The economic criterion is the long-run average return. A set of conditions is shown to be sufficient for a constant g to be optimal average return and a stationary policy π¹ to be optimal. This condition is shown to be satisfied under appropriate assumptions on the optimal discounted return function. A policy improvement algorithm is proposed and its convergence to an optimal policy is proved. 相似文献

9.

Mean,variance, and probabilistic criteria in finite Markov decision processes: A review 总被引：3，自引：0，他引：3

D. J. White 《Journal of Optimization Theory and Applications》1988,56(1):1-29

This paper is a survey of papers which make use of nonstandard Markov decision process criteria (i.e., those which do not seek simply to optimize expected returns per unit time or expected discounted return). It covers infinite-horizon nondiscounted formulations, infinite-horizon discounted formulations, and finite-horizon formulations. For problem formulations in terms solely of the probabilities of being in each state and taking each action, policy equivalence results are given which allow policies to be restricted to the class of Markov policies or to the randomizations of deterministic Markov policies. For problems which cannot be stated in such terms, in terms of the primitive state setI, formulations involving a redefinition of the states are examined.The author would like to thank two referees for a very thorough and helpful referceing of the original article and for the extra references (Refs. 47–52) now added to the original reference list. 相似文献

10.

The variational calculus and approximation in policy space for Markovian decision processes

Paul J Schweitzer 《Journal of Mathematical Analysis and Applications》1985,111(1):14-25

The functional equations of Markovian decision processes yield the state values (and gain rate in the undiscounted case). Variational expressions are exhibited here for these state values (and gain rate); these expressions are stationary when evaluated at the correct values. When guesses for the values (and gain rate) are inserted into these variational expressions, a superior guess is usually obtained. Repetition of this procedure is shown to be equivalent to the method of successive approximations in policy space. Two other unusual features of this procedure are these: when the linear equations determining the Lagrange multipliers are non-singular, the variational expressions for the state variables are precisely one Newton-Raphson iteration; when applied to a linear objective function and piecewise-linear constraints, which arises for the functional equations of Markovian decision processes, the variational test quantity is piecewise constant, i.e., its first variation and higher variations all vanish. The latter explains its good performance (one-step convergence) if good estimates are available. 相似文献

11.

Deviation Matrix,Laurent Series and Blackwell Optimality in Countable State Markov Decision Processes

《Optimization》2012,61(1):191-202

This paper presents a recurrent condition on Markov decision processes with a countable state space and bounded rewards. The condition is sufficient for the existence of a Blackwell optimal stationary policy, having the Laurent series expansion with continuous coefficients. It is so relaxed that the Markov chain corresponding to a stationary policy may have countably many periodic recurrent classes. Our method finds the deviation matrix in an explicit form. 相似文献

12.

Constrained denumerable state non-stationary MDPs with expected total reward criterion

郭先平《应用数学学报(英文版)》2000,16(2):205-212

1.IntroductionandModelTheearlierliteratureaboutconstrainedMarkovdecisionprocesses(MDPs,forshort)canbefoundinDerman'sbook[1].Later,therehavebeenmanyachievementsinthisarea.Forexample,averagerewardMDPswithaconstrainthasbeendiscussedbyBeutleandRosslz],HordijkandKallenberg[3]jAltmanandSchwartz[4],etal.Inthecaseoffinitestatespace,discountedrewardcriterionMDPswithaconstrainthasbeentreatedbyKallenberg['landTanaka[6],etal.Whenstatespaceisdenumerable,suchproblemswerediscussedbySennott[71andAlt… 相似文献

13.

Finding the K best policies in a finite-horizon Markov decision process

《European Journal of Operational Research》2006,175(2):1164-1179

Directed hypergraphs represent a general modelling and algorithmic tool, which have been successfully used in many different research areas such as artificial intelligence, database systems, fuzzy systems, propositional logic and transportation networks. However, modelling Markov decision processes using directed hypergraphs has not yet been considered.In this paper we consider finite-horizon Markov decision processes (MDPs) with finite state and action space and present an algorithm for finding the K best deterministic Markov policies. That is, we are interested in ranking the first K deterministic Markov policies in non-decreasing order using an additive criterion of optimality. The algorithm uses a directed hypergraph to model the finite-horizon MDP. It is shown that the problem of finding the optimal policy can be formulated as a minimum weight hyperpath problem and be solved in linear time, with respect to the input data representing the MDP, using different additive optimality criteria. 相似文献

14.

非时齐部分可观察Markov决策规划的最优策略问题

张继红郭世贞章芸《运筹学学报》2004,8(2):81-87

本文讨论了一类非时齐部分可观察Markov决策模型．在不改变状态空间可列性的条件下，把该模型转化为[5]中的一般化折扣模型，从而解决了其最优策略问题，并且得到了该模型的有限阶段逼近算法，其中该算法涉及的状态是可列的．相似文献

15.

A sequential condition‐based repair/replacement policy with non‐periodic inspections for a system subject to continuous wear

B. Castanier C. Brenguer A. Grall 《商业与工业应用随机模型》2003,19(4):327-347

This paper studies a condition‐based maintenance policy for a repairable system subject to a continuous‐state gradual deterioration monitored by sequential non‐periodic inspections. The system can be maintained using different maintenance operations (partial repair, as good as new replacement) with different effects (on the system state), costs and durations. A parametric decision framework (multi‐threshold policy) is proposed to choose sequentially the best maintenance actions and to schedule the future inspections, using the on‐line monitoring information on the system deterioration level gained from the current inspection. Taking advantage of the semi‐regenerative (or Markov renewal) properties of the maintained system state, we construct a stochastic model of the time behaviour of the maintained system at steady state. This stochastic model allows to evaluate several performance criteria for the maintenance policy such as the long‐run system availability and the long‐run expected maintenance cost. Numerical experiments illustrate the behaviour of the proposed condition‐based maintenance policy. Copyright © 2003 John Wiley & Sons, Ltd. 相似文献

16.

A structured pattern matrix algorithm for multichain Markov decision processes

Tetsuichiro Iki Masayuki Horiguchi Masami Kurano 《Mathematical Methods of Operations Research》2007,66(3):545-555

In this paper, we are concerned with a new algorithm for multichain finite state Markov decision processes which finds an average optimal policy through the decomposition of the state space into some communicating classes and a transient class. For each communicating class, a relatively optimal policy is found, which is used to find an optimal policy by applying the value iteration algorithm. Using a pattern matrix determining the behaviour pattern of the decision process, the decomposition of the state space is effectively done, so that the proposed algorithm simplifies the structured one given by the excellent Leizarowitz’s paper (Math Oper Res 28:553–586, 2003). Also, a numerical example is given to comprehend the algorithm. 相似文献

17.

Single Sample Path-Based Optimization of Markov Chains 总被引：11，自引：0，他引：11

Cao X. R. 《Journal of Optimization Theory and Applications》1999,100(3):527-548

Motivated by the needs of on-line optimization of real-world engineering systems, we studied single sample path-based algorithms for Markov decision problems (MDP). The sample path used in the algorithms can be obtained by observing the operation of a real system. We give a simple example to explain the advantages of the sample path-based approach over the traditional computation-based approach: matrix inversion is not required; some transition probabilities do not have to be known; it may save storage space; and it gives the flexibility of iterating the actions for a subset of the state space in each iteration. The effect of the estimation errors and the convergence property of the sample path-based approach are studied. Finally, we propose a fast algorithm, which updates the policy whenever the system reaches a particular set of states and prove that the algorithm converges to the true optimal policy with probability one under some conditions. The sample path-based approach may have important applications to the design and management of engineering systems, such as high speed communication networks.This work was supported in part by 相似文献

18.

Contraction mappings underlying undiscounted Markov decision problems

A Federgruen P.J Schweitzer H.C Tijms 《Journal of Mathematical Analysis and Applications》1978,65(3):711-730

This paper is concerned with the properties of the value-iteration operator0 which arises in undiscounted Markov decision problems. We give both necessary and sufficient conditions for this operator to reduce to a contraction operator, in which case it is easy to show that the value-iteration method exhibits a uniform geometric convergence rate. As necessary conditions we obtain a number of important characterizations of the chain and periodicity structures of the problem, and as sufficient conditions, we give a general “scrambling-type” recurrency condition, which encompasses a number of important special cases. Next, we show that a data transformation turns every unichained undiscounted Markov Renewal Program into an equivalent undiscounted Markov decision problem, in which the value-iteration operator is contracting, because it satisfies this “scrambling-type” condition. We exploit this contraction property in order to obtain lower and upper bounds as well as variational characterizations for the fixed point of the optimality equation and a test for eliminating suboptimal actions. 相似文献

19.

Discounted continuous-time Markov decision processes with unbounded rates and randomized history-dependent policies: the dynamic programming approach

Alexey Piunovskiy Yi Zhang 《4OR: A Quarterly Journal of Operations Research》2014,12(1):49-75

This paper deals with a continuous-time Markov decision process in Borel state and action spaces and with unbounded transition rates. Under history-dependent policies, the controlled process may not be Markov. The main contribution is that for such non-Markov processes we establish the Dynkin formula, which plays important roles in establishing optimality results for continuous-time Markov decision processes. We further illustrate this by showing, for a discounted continuous-time Markov decision process, the existence of a deterministic stationary optimal policy (out of the class of history-dependent policies) and characterizing the value function through the Bellman equation. 相似文献

20.

Optimal replacement policies with minimal repair and age-dependent costs

《European Journal of Operational Research》1997,98(1):75-84

In this paper, we study a modified minimal repair/replacement problem that is formulated as a Markov decision process. The operating cost is assumed to be a nondecreasing function of the system's age. The specific maintenance actions for a manufacturing system to be considered are whether to have replacement, minimal repair or keep it operating. It is shown that a control limit policy, or in particular a (t, T) policy, is optimal over the space of all possible policies under the discounted cost criterion. A computational algorithm for the optimal (t, T) policy is suggested based on the total expected discounted cost. 相似文献