期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Multivariate Bayesian process control for a finite production run

Viliam Makis 《European Journal of Operational Research》2009

Most industrial products and processes are characterized by several, typically correlated measurable variables, which jointly describe the product or process quality. Various control charts such as Hotelling’s T², EWMA and CUSUM charts have been developed for multivariate quality control, where the values of the chart parameters, namely the sample size, sampling interval and the control limits are determined to satisfy given economic and/or statistical requirements. It is well known that this traditional non-Bayesian approach to a control chart design is not optimal, but very few results regarding the form of the optimal Bayesian control policy have appeared in the literature, all limited to a univariate chart design. In this paper, we consider a multivariate Bayesian process mean control problem for a finite production run under the assumption that the observations are values of independent, normally distributed vectors of random variables. The problem is formulated in the POMDP (partially observable Markov decision process) framework and the objective is to determine a control policy minimizing the total expected cost. It is proved that under standard operating and cost assumptions the control limit policy is optimal. Cost comparisons with the benchmark chi-squared chart and the MEWMA chart show that the Bayesian chart is highly cost effective, the savings are larger for smaller values of the critical Mahalanobis distance between the in-control and out-of-control process mean. 相似文献

2.

Heuristic anytime approaches to stochastic decision processes

Joaquín L. Fernández Rafael Sanz Reid G. Simmons Amador R. Diéguez 《Journal of Heuristics》2006,12(3):181-209

This paper proposes a set of methods for solving stochastic decision problems modeled as partially observable Markov decision processes (POMDPs). This approach (Real Time Heuristic Decision System, RT-HDS) is based on the use of prediction methods combined with several existing heuristic decision algorithms. The prediction process is one of tree creation. The value function for the last step uses some of the classic heuristic decision methods. To illustrate how this approach works, comparative results of different algorithms with a variety of simple and complex benchmark problems are reported. The algorithm has also been tested in a mobile robot supervision architecture. 相似文献

3.

Reinforcement learning versus heuristics for order acceptance on a single resource

M. Mainegra Hing A. van Harten P. C. Schuur 《Journal of Heuristics》2007,13(2):167-187

Order Acceptance (OA) is one of the main functions in business control. Accepting an order when capacity is available could disable the system to accept more profitable orders in the future with opportunity losses as a consequence. Uncertain information is also an important issue here. We use Markov decision models and learning methods from Artificial Intelligence to find decision policies under uncertainty. Reinforcement Learning (RL) is quite a new approach in OA. It is shown here that RL works well compared with heuristics. It is demonstrated that employing an RL trained agent is a robust, flexible approach that in addition can be used to support the detection of good heuristics. 相似文献

4.

Algebraic results and bottom-up algorithm for policies generalization in reinforcement learning using concept lattices

Marc Ricordeau Michel Liquire 《Nonlinear Analysis: Hybrid Systems》2008,2(2):684-694

The generalization of policies in reinforcement learning is a main issue, both from the theoretical model point of view and for their applicability. However, generalizing from a set of examples or searching for regularities is a problem which has already been intensively studied in machine learning. Thus, existing domains such as Inductive Logic Programming have already been linked with reinforcement learning. Our work uses techniques in which generalizations are constrained by a language bias, in order to regroup similar states. Such generalizations are principally based on the properties of concept lattices. To guide the possible groupings of similar states of the environment, we propose a general algebraic framework, considering the generalization of policies through a partition of the set of states and using a language bias as an a priori knowledge. We give a practical application as an example of our theoretical approach by proposing and experimenting a bottom-up algorithm. 相似文献

5.

On the evaluation of strategies for branching bandit processes 总被引：1，自引：0，他引：1

K. D. Glazebrook R. J. Boys N. A. Fay 《Annals of Operations Research》1991,30(1):299-319

Glazebrook [1] has given an account of improved procedures for strategy evaluation for resource allocation in a stochastic environment. These methods are extended in the paper in such a way that they can be applied to problems which, for example, have precedence constraints and/or an arrivals process of new jobs. Theoretical results, backed up by numerical studies, show that quasi-myopic heuristics often perform well. 相似文献

6.

On an extremal property of Markov chains and sufficiency of Markov strategies in Markov decision processes with the Dubins-Savage criterion

I. M. Sonin 《Annals of Operations Research》1991,29(1):417-426

An inequality regarding the minimum ofP(lim inf(X _n D _n)) is proved for a class of random sequences. This result is related to the problem of sufficiency of Markov strategies for Markov decision processes with the Dubins-Savage criterion, the asymptotical behaviour of nonhomogeneous Markov chains, and some other problems. 相似文献

7.

Learning of winning strategies for terminal games with linear-size memory

Thomas Böhme Frank Göring Zsolt Tuza Herwig Unger 《International Journal of Game Theory》2009,38(2):155-168

We prove that if one or more players in a locally finite positional game have winning strategies, then they can find it by themselves, not losing more than a bounded number of plays and not using more than a linear-size memory, independently of the strategies applied by the other players. We design two algorithms for learning how to win. One of them can also be modified to determine a strategy that achieves a draw, provided that no winning strategy exists for the player in question but with properly chosen moves a draw can be ensured from the starting position. If a drawing- or winning strategy exists, then it is learnt after no more than a linear number of plays lost (linear in the number of edges of the game graph). Z. Tuza’s research has been supported in part by the grant OTKA T-049613. 相似文献

8.

Monotone optimal replacement policies for a Markovian deteriorating system in a controllable environment

Murat Kurt Jeffrey P. Kharoufeh 《Operations Research Letters》2010,38(4):273-279

We consider the optimal replacement of a periodically inspected system under Markov deterioration that operates in a controlled environment. Provided are sufficient conditions that characterize an optimal control-limit replacement policy with respect to the system’s condition and its environment. The structure of the optimal policy is illustrated by a numerical example. 相似文献

9.

Structured replacement policies for a Markov-modulated shock model

Murat Kurt Lisa M. Maillart 《Operations Research Letters》2009,37(4):280-284

We establish the optimality of structured replacement policies for a periodically inspected system that fails silently whenever the cumulative number of shocks, or the magnitude of a single shock it has received, exceeds a corresponding threshold. Shocks arrive according to a Markov-modulated Poisson process which represents the (controllable or uncontrollable) environment. 相似文献

10.

Dynamic repositioning strategy in a bike-sharing system; how to prioritize and how to rebalance a bike station

Benjamin Legros 《European Journal of Operational Research》2019,272(2):740-753

Bike-sharing systems are becoming increasingly popular in large cities. The natural imbalance and the stochasticity of bike’s arrivals and departures lead operators to develop redistribution strategies in order to ensure a sufficiently high quality of service for users. Using a Markov decision process approach, we develop an implementable decision-support tool which may help the operator to decide at any point of time (i) which station should be prioritized, and (ii) which number of bikes should be added or removed at each station. Our objective is to minimize the rate of arrival of unsatisfied users who find their station empty or full. The existence of an optimal inventory level at each station is proven. It may vary over time but does not depend on the capacity of the truck which operates the repositioning. Next, we compute the relative value function of the system, together with the average cost and the optimal state. These results are used to derive a policy for station’s prioritization using a one-step policy improvement method. We evaluate our policy in comparison with the optimal one and with other intuitive ones in an extended version of our model. From our numerical experiments, we show that only a little intervention of the operator can significantly enhance the quality of service, and that the rule of thumb for bike repositioning is to prioritize the closer, the more active, the closer to be full or empty, and the more imbalanced stations if no reversing in the imbalance is anticipated. 相似文献

11.

Dynamic inventory strategies for profit maximization in a service facility with stochastic service,demand and lead time

Oded Berman Eungab Kim 《Mathematical Methods of Operations Research》2004,60(3):497-521

相似文献

12.

Equivalence classes for optimizing risk models in Markov decision processes

Yoshio?Ohtsubo Email author Kenji?Toyonaga 《Mathematical Methods of Operations Research》2004,60(2):239-250

相似文献

13.

Modified policy iteration algorithms are not strongly polynomial for discounted dynamic programming

《Operations Research Letters》2014,42(6-7):429-431

This note shows that the number of arithmetic operations required by any member of a broad class of optimistic policy iteration algorithms to solve a deterministic discounted dynamic programming problem with three states and four actions may grow arbitrarily. Therefore any such algorithm is not strongly polynomial. In particular, the modified policy iteration and

λ

-policy iteration algorithms are not strongly polynomial. 相似文献

14.

Another set of conditions for Markov decision processes with average sample-path costs

Quanxin Zhu Xianping Guo 《Journal of Mathematical Analysis and Applications》2006,322(2):1199-1214

This paper deals with discrete-time Markov decision processes with average sample-path costs (ASPC) in Borel spaces. The costs may have neither upper nor lower bounds. We propose new conditions for the existence of ε-ASPC-optimal (deterministic) stationary policies in the class of all randomized history-dependent policies. Our conditions are weaker than those in the previous literature. Moreover, some sufficient conditions for the existence of ASPC optimal stationary policies are imposed on the primitive data of the model. In particular, the stochastic monotonicity condition in this paper has first been used to study the ASPC criterion. Also, the approach provided here is slightly different from the “optimality equation approach” widely used in the previous literature. On the other hand, under mild assumptions we show that average expected cost optimality and ASPC-optimality are equivalent. Finally, we use a controlled queueing system to illustrate our results. 相似文献

15.

Optimal policy for minimizing risk models in Markov decision processes

Y. Ohtsubo K. Toyonaga 《Journal of Mathematical Analysis and Applications》2002,271(1):66-81

We consider the minimizing risk problems in discounted Markov decisions processes with countable state space and bounded general rewards. We characterize optimal values for finite and infinite horizon cases and give two sufficient conditions for the existence of an optimal policy in an infinite horizon case. These conditions are closely connected with Lemma 3 in White (1993), which is not correct as Wu and Lin (1999) point out. We obtain a condition for the lemma to be true, under which we show that there is an optimal policy. Under another condition we show that an optimal value is a unique solution to some optimality equation and there is an optimal policy on a transient set. 相似文献

16.

Sample-path optimality and variance-maximization for Markov decision processes

Q. X. Zhu 《Mathematical Methods of Operations Research》2007,65(3):519-538

This paper studies both the average sample-path reward (ASPR) criterion and the limiting average variance criterion for denumerable discrete-time Markov decision processes. The rewards may have neither upper nor lower bounds. We give sufficient conditions on the system’s primitive data and under which we prove the existence of ASPR-optimal stationary policies and variance optimal policies. Our conditions are weaker than those in the previous literature. Moreover, our results are illustrated by a controlled queueing system. Research partially supported by the Natural Science Foundation of Guangdong Province (Grant No: 06025063) and the Natural Science Foundation of China (Grant No: 10626021). 相似文献

17.

Approximate dynamic programming for capacity allocation in the service industry 总被引：1，自引：0，他引：1

Hans-Jörg SchützRainer Kolisch 《European Journal of Operational Research》2012,218(1):239-250

We consider a problem where different classes of customers can book different types of service in advance and the service company has to respond immediately to the booking request confirming or rejecting it. The objective of the service company is to maximize profit made of class-type specific revenues, refunds for cancellations or no-shows as well as cost of overtime. For the calculation of the latter, information on the underlying appointment schedule is required. In contrast to most models in the literature we assume that the service time of clients is stochastic and that clients might be unpunctual. Throughout the paper we will relate the problem to capacity allocation in radiology services. The problem is modeled as a continuous-time Markov decision process and solved using simulation-based approximate dynamic programming (ADP) combined with a discrete event simulation of the service period. We employ an adapted heuristic ADP algorithm from the literature and investigate on the benefits of applying ADP to this type of problem. First, we study a simplified problem with deterministic service times and punctual arrival of clients and compare the solution from the ADP algorithm to the optimal solution. We find that the heuristic ADP algorithm performs very well in terms of objective function value, solution time, and memory requirements. Second, we study the problem with stochastic service times and unpunctuality. It is then shown that the resulting policy constitutes a large improvement over an “optimal” policy that is deduced using restrictive, simplifying assumptions. 相似文献

18.

Average optimality for continuous-time Markov decision processes with a policy iteration approach

Quanxin Zhu 《Journal of Mathematical Analysis and Applications》2008,339(1):691-704

This paper deals with the average expected reward criterion for continuous-time Markov decision processes in general state and action spaces. The transition rates of underlying continuous-time jump Markov processes are allowed to be unbounded, and the reward rates may have neither upper nor lower bounds. We give conditions on the system's primitive data and under which we prove the existence of the average reward optimality equation and an average optimal stationary policy. Also, under our conditions we ensure the existence of ?-average optimal stationary policies. Moreover, we study some properties of average optimal stationary policies. We not only establish another average optimality equation on an average optimal stationary policy, but also present an interesting “martingale characterization” of such a policy. The approach provided in this paper is based on the policy iteration algorithm. It should be noted that our way is rather different from both the usually “vanishing discounting factor approach” and the “optimality inequality approach” widely used in the previous literature. 相似文献

19.

Average optimality inequality for continuous-time Markov decision processes in Polish spaces

Quanxin Zhu 《Mathematical Methods of Operations Research》2007,66(2):299-313

In this paper, we study the average optimality for continuous-time controlled jump Markov processes in general state and action spaces. The criterion to be minimized is the average expected costs. Both the transition rates and the cost rates are allowed to be unbounded. We propose another set of conditions under which we first establish one average optimality inequality by using the well-known “vanishing discounting factor approach”. Then, when the cost (or reward) rates are nonnegative (or nonpositive), from the average optimality inequality we prove the existence of an average optimal stationary policy in all randomized history dependent policies by using the Dynkin formula and the Tauberian theorem. Finally, when the cost (or reward) rates have neither upper nor lower bounds, we also prove the existence of an average optimal policy in all (deterministic) stationary policies by constructing a “new” cost (or reward) rate. Research partially supported by the Natural Science Foundation of China (Grant No: 10626021) and the Natural Science Foundation of Guangdong Province (Grant No: 06300957). 相似文献

20.

Learning dynamic prices in electronic retail markets with customer segmentation

C. V. L. Raju Y. Narahari K. Ravikumar 《Annals of Operations Research》2006,143(1):59-75

In this paper, we use reinforcement learning (RL) techniques to determine dynamic prices in an electronic monopolistic retail market. The market that we consider consists of two natural segments of customers, captives and shoppers. Captives are mature, loyal buyers whereas the shoppers are more price sensitive and are attracted by sales promotions and volume discounts. The seller is the learning agent in the system and uses RL to learn from the environment. Under (reasonable) assumptions about the arrival process of customers, inventory replenishment policy, and replenishment lead time distribution, the system becomes a Markov decision process thus enabling the use of a wide spectrum of learning algorithms. In this paper, we use the Q-learning algorithm for RL to arrive at optimal dynamic prices that optimize the seller’s performance metric (either long term discounted profit or long run average profit per unit time). Our model and methodology can also be used to compute optimal reorder quantity and optimal reorder point for the inventory policy followed by the seller and to compute the optimal volume discounts to be offered to the shoppers. 相似文献