首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 590 毫秒
1.
We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound.  相似文献   

2.
We introduce a class of models for multidimensional control problems that we call skip-free Markov decision processes on trees. We describe and analyse an algorithm applicable to Markov decision processes of this type that are skip-free in the negative direction. Starting with the finite average cost case, we show that the algorithm combines the advantages of both value iteration and policy iteration—it is guaranteed to converge to an optimal policy and optimal value function after a finite number of iterations but the computational effort required for each iteration step is comparable with that for value iteration. We show that the algorithm can also be used to solve discounted cost models and continuous-time models, and that a suitably modified algorithm can be used to solve communicating models.  相似文献   

3.
In this paper, a unified policy iteration approach is presented for the optimal control problem of stochastic system with discounted average cost and continuous state space. The approach consists of temporal difference learning-based potential function approximation algorithms and performance difference formula-based policy improvement. The approximation algorithms are derived by solving the Poisson equation-based fixed-point equation, which can be viewed as continuous versions of least squares policy evaluation algorithm and least squares temporal difference algorithm. The simulations are provided to illustrate the effectiveness of the approach.  相似文献   

4.
We study risk-sensitive control of continuous time Markov chains taking values in discrete state space. We study both finite and infinite horizon problems. In the finite horizon problem we characterize the value function via Hamilton Jacobi Bellman equation and obtain an optimal Markov control. We do the same for infinite horizon discounted cost case. In the infinite horizon average cost case we establish the existence of an optimal stationary control under certain Lyapunov condition. We also develop a policy iteration algorithm for finding an optimal control.  相似文献   

5.
Policy iteration is a well-studied algorithm for solving stationary Markov decision processes (MDPs). It has also been extended to robust stationary MDPs. For robust nonstationary MDPs, however, an “as is” execution of this algorithm is not possible because it would call for an infinite amount of computation in each iteration. We therefore present a policy iteration algorithm for robust nonstationary MDPs, which performs finitely implementable approximate variants of policy evaluation and policy improvement in each iteration. We prove that the sequence of cost-to-go functions produced by this algorithm monotonically converges pointwise to the optimal cost-to-go function; the policies generated converge subsequentially to an optimal policy.  相似文献   

6.
By the dynamic programming principle the value function of an optimally controlled stochasticswitching process can be shown to satisfy a boundary value problem for a fully nonlinear second-order elliptic differential equation of Hamilton-Jacobi-Bellman (HJB-) type. For the numerical solution of that HJB-equation we present a multi-grid algorithm whose main features arethe use of nonlinear Gauss-Seidel iteration in the smoothing process and an adaptive local choice of prolongations and restrictions in the coarse-to-fine and fine-to-coarse transfers. Local convergence is proved by combining nonlinear multi-grid convergence theory and elementarysubdifferential calculus. The efficiency of the algorithm is demonstrated for optimal advertising in stochastic dynamic sales response models of Vidale-Wolfe type.  相似文献   

7.
8.
In just-in-time (JIT) production systems, there is both input stock in the form of parts and output stock in the form of product at each stage. These activities are controlled by production-ordering and withdrawal kanbans. This paper discusses a discrete-time optimal control problem in a multistage JIT-based production and distribution system with stochastic demand and capacity, developed to minimize the expected total cost per unit of time. The problem can be formulated as an undiscounted Markov decision process (UMDP); however, the curse of dimensionality makes it very difficult to find an exact solution. The author proposes a new neuro-dynamic programming (NDP) algorithm, the simulation-based modified policy iteration method (SBMPIM), to solve the optimal control problem. The existing NDP algorithms and SBMPIM are numerically compared with a traditional UMDP algorithm for a single-stage JIT production system. It is shown that all NDP algorithms except the SBMPIM fail to converge to an optimal control.Additionally, a new algorithm for finding the optimal parameters of pull systems is proposed. Numerical comparisons between near-optimal controls computed using the SBMPIM and optimized pull systems are conducted for three-stage JIT-based production and distribution systems. UMDPs with 42 million states are solved using the SBMPIM. The pull systems discussed are the kanban, base stock, CONWIP, hybrid and extended kanban.  相似文献   

9.
We consider a production planning problem for a jobshop with unreliable machines producing a number of products. There are upper and lower bounds on intermediate parts and an upper bound on finished parts. The machine capacities are modelled as finite state Markov chains. The objective is to choose the rate of production so as to minimize the total discounted cost of inventory and production. Finding an optimal control policy for this problem is difficult. Instead, we derive an asymptotic approximation by letting the rates of change of the machine states approach infinity. The asymptotic analysis leads to a limiting problem in which the stochastic machine capacities are replaced by their equilibrium mean capacities. The value function for the original problem is shown to converge to the value function of the limiting problem. The convergence rate of the value function together with the error estimate for the constructed asymptotic optimal production policies are established.  相似文献   

10.
This paper investigates a dynamic event-triggered optimal control problem of discrete-time (DT) nonlinear Markov jump systems (MJSs) via exploring policy iteration (PI) adaptive dynamic programming (ADP) algorithms. The performance index function (PIF) defined in each subsystem is updated by utilizing an online PI algorithm, and the corresponding control policy is derived via solving the optimal PIF. Then, we adopt neural network (NN) techniques, including an actor network and a critic network, to estimate the iterative PIF and control policy. Moreover, the designed dynamic event-triggered mechanism (DETM) is employed to avoid wasting additional resources when the estimated iterative control policy is updated. Finally, based on the Lyapunov difference method, it is proved that the system stability and the convergence of all signals can be guaranteed under the developed control scheme. A simulation example for DT nonlinear MJSs with two system modes is presented to demonstrate the feasibility of the control design scheme.  相似文献   

11.
This paper studies the risk minimization problem in semi-Markov decision processes with denumerable states. The criterion to be optimized is the risk probability (or risk function) that a first passage time to some target set doesn't exceed a threshold value. We first characterize such risk functions and the corresponding optimal value function, and prove that the optimal value function satisfies the optimality equation by using a successive approximation technique. Then, we present some properties of optimal policies, and further give conditions for the existence of optimal policies. In addition, a value iteration algorithm and a policy improvement method for obtaining respectively the optimal value function and optimal policies are developed. Finally, two examples are given to illustrate the value iteration procedure and essential characterization of the risk function.  相似文献   

12.
This note provides a simple example demonstrating that, if exact computations are allowed, the number of iterations required for the value iteration algorithm to find an optimal policy for discounted dynamic programming problems may grow arbitrarily quickly with the size of the problem. In particular, the number of iterations can be exponential in the number of actions. Thus, unlike policy iterations, the value iteration algorithm is not strongly polynomial for discounted dynamic programming.  相似文献   

13.
Motivated by multi-user optimization problems and non-cooperative Nash games in uncertain regimes, we consider stochastic Cartesian variational inequality problems where the set is given as the Cartesian product of a collection of component sets. First, we consider the case where the number of the component sets is large and develop a randomized block stochastic mirror-prox algorithm, where at each iteration only a randomly selected block coordinate of the solution vector is updated through implementing two consecutive projection steps. We show that when the mapping is strictly pseudo-monotone, the algorithm generates a sequence of iterates that converges to the solution of the problem almost surely. When the maps are strongly pseudo-monotone, we prove that the mean-squared error diminishes at the optimal rate. Second, we consider large-scale stochastic optimization problems with convex objectives and develop a new averaging scheme for the randomized block stochastic mirror-prox algorithm. We show that by using a different set of weights than those employed in the classical stochastic mirror-prox methods, the objective values of the averaged sequence converges to the optimal value in the mean sense at an optimal rate. Third, we consider stochastic Cartesian variational inequality problems and develop a stochastic mirror-prox algorithm that employs the new weighted averaging scheme. We show that the expected value of a suitably defined gap function converges to zero at an optimal rate.  相似文献   

14.
This paper studies the policy iteration algorithm (PIA) for average cost Markov control processes on Borel spaces. Two classes of MCPs are considered. One of them allows some restricted-growth unbounded cost functions and compact control constraint sets; the other one requires strictly unbounded costs and the control constraint sets may be non-compact. For each of these classes, the PIA yields, under suitable assumptions, the optimal (minimum) cost, an optimal stationary control policy, and a solution to the average cost optimality equation.  相似文献   

15.
16.
We consider a deterministic simple epidemic process in which the susceptibles are exposed to n+1 diseases. It is assumed that one disease is relatively harmless while the others cause serious symptoms. Policies for introducing infection by the harmless disease are considered and, under a suitable cost structure, the optimal policy that minimises the future cost for every initial state is found. For the corresponding stochastic model, the optimal policy is found by implementing a suitable dynamic programming algorithm, and is compared numerically with the optimal policy for the deterministic model.  相似文献   

17.
The paper proves the convergence of (Approximate) Iterated Successive Approximations Algorithm for solving infinite-horizon sequential decision processes satisfying the monotone contraction assumption. At every stage of this algorithm, the value function at hand is used as a terminal reward to determine an (approximately) optimal policy for the one-period problem. This policy is then iterated for a (finite or infinite) number of times and the resulting return function is used as the starting value function for the next stage of the scheme. This method generalizes the standard successive approximations, policy iteration and Denardo’s generalization of the latter.  相似文献   

18.
In this paper, we deal with a multi-item, stochastic, periodic review inventory system with general cost structure which permits partial or complete backlogging of unfilled demand. Since both the (, S) policy and the mixed reorder policy are not optimal, we derive several properties of an optimal ordering policy and propose a new algorithm for computing it. This algorithm is based on the policy iteration method (PIM), but reduces substantially computation times in the policy evaluation and improvement routines of the PIM.  相似文献   

19.
In this paper, we propose a new optimization technique by modifying a chaos optimization algorithm (COA) based on the fractal theory. We first implement the weighted gradient direction-based chaos optimization in which the chaotic property is used to determine the initial choice of the optimization parameters both in the starting step and in the mutations applied when a convergence to local minima occurred. The algorithm is then improved by introducing a method to determine the optimal step size. This method is based on the fact that the sensitive dependence on the initial condition of a root finding technique (such as the Newton–Raphson search technique) has a fractal nature. From all roots (step sizes) found by the implemented technique, the one that most minimizes the cost function is employed in each iteration. Numerical simulation results are presented to evaluate the performance of the proposed algorithm.  相似文献   

20.
Using the decomposition of solution of SDE, we consider the stochastic optimal control problem with anticipative controls as a family of deterministic control problems parametrized by the paths of the driving Wiener process and of a newly introduced Lagrange multiplier stochastic process (nonanticipativity equality constraint). It is shown that the value function of these problems is the unique global solution of a robust equation (random partial differential equation) associated to a linear backward Hamilton-Jacobi-Bellman stochastic partial differential equation (HJB SPDE). This appears as limiting SPDE for a sequence of random HJB PDE's when linear interpolation approximation of the Wiener process is used. Our approach extends the Wong-Zakai type results [20] from SDE to the stochastic dynamic programming equation by showing how this arises as average of the limit of a sequence of deterministic dynamic programming equations. The stochastic characteristics method of Kunita [13] is used to represent the value function. By choosing the Lagrange multiplier equal to its nonanticipative constraint value the usual stochastic (nonanticipative) optimal control and optimal cost are recovered. This suggests a method for solving the anticipative control problems by almost sure deterministic optimal control. We obtain a PDE for the “cost of perfect information” the difference between the cost function of the nonanticipative control problem and the cost of the anticipative problem which satisfies a nonlinear backward HJB SPDE. Poisson bracket conditions are found ensuring this has a global solution. The cost of perfect information is shown to be zero when a Lagrangian submanifold is invariant for the stochastic characteristics. The LQG problem and a nonlinear anticipative control problem are considered as examples in this framework  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号