首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 312 毫秒
1.
We consider multi-armed bandit problems with switching cost and multiple plays, define  相似文献   

2.
A symmetric Poissonian two-armed bandit becomes, in terms of a posteriori probabilities, a piecewise deterministic Markov decision process. For the case of the switching arms, only of one which creates rewards, we solve explicitly the average optimality equation and prove that a myopic policy is average optimal.Supported by NSF grant DMS-9404177  相似文献   

3.
Gittins has shown that for a class of Markov decision processes called alternative bandit processes, optimal policies can easily be determined once the dynamic allocation indices (DAIs) for the constituent bandit processes are computed. Improved algorithms are presented for calculating DAIs both for general bandit processes and for the well-known special case of the multi-armed bandit problem.  相似文献   

4.
Narendra-Shapiro (NS) algorithms are bandit-type algorithms developed in the 1960s. NS-algorithms have been deeply studied in infinite horizon but little non-asymptotic results exist for this type of bandit algorithms. In this paper, we focus on a non-asymptotic study of the regret and address the following question: are Narendra-Shapiro bandit algorithms competitive from this point of view? In our main result, we obtain some uniform explicit bounds for the regret of (over)-penalized-NS algorithms. We also extend to the multi-armed case some convergence properties of penalized-NS algorithms towards a stationary Piecewise Deterministic Markov Process (PDMP). Finally, we establish some new sharp mixing bounds for these processes.  相似文献   

5.
Bandit products have captured significant market shares in China and have started to expand throughout the world. A striking feature of supply chains for bandit products is decentralization, where the upstream firm determines the product quality and the downstream firms compete on prices. We study the competition between a centralized mainstream firm and a decentralized bandit supply chain. We demonstrate that the structural difference between the mainstream firm and the bandit supply chain reduces competition intensity and the quality difference between their products. Surprisingly, the inherent inefficiency in a bandit supply chain, combined with the force of competition, actually leads to both higher product quality and higher price. Furthermore, due to the free-riding effect, the bandit supply chain may even offer higher quality products than the mainstream firm. The mainstream firm’s profit as a function of the free-riding effect is U-shaped, so that free-riding by the bandit supply chain may eventually benefit the mainstream firm. Finally, decentralization benefits the bandit supply chain when the competition is on product features.  相似文献   

6.
One-armed bandit models with continuous and delayed responses   总被引:2,自引:0,他引:2  
One-armed bandit processes with continuous delayed responses are formulated as controlled stochastic processes following the Bayesian approach. It is shown that under some regularity conditions, a Gittins-like index exists which is the limit of a monotonic sequence of break-even values characterizing optimal initial selections of arms for finite horizon bandit processes. Furthermore, there is an optimal stopping solution when all observations on the unknown arm are complete. Results are illustrated with a bandit model having exponentially distributed responses, in which case the controlled stochastic process becomes a Markov decision process, the Gittins-like index is the Gittins index and the Gittins index strategy is optimal. Acknowledgement.We thank an anonymous referee for constructive and insightful comments, especially those related to the notion of the Gittins index.Both authors are funded by the Natural Sciences and Engineering Research Council (NSERC) of Canada.  相似文献   

7.
A two‐armed bandit model using a Bayesian approach is formulated and investigated in this paper with the goal of maximizing the value of a certain criterion of optimality. The bandit model illustrates the trade‐off between exploration and exploitation, where exploration means acquiring scientific acknowledge for better‐informed decisions at later stages (ie, maximizing long‐term benefit), and exploitation means applying the current knowledge for the best possible outcome at the current stage (ie, maximizing the immediate expected payoff). When one arm has known characteristics, stochastic dynamic programming is applied to characterize the optimal strategy and provide the foundation for its calculation. The results show that the celebrated Gittins index can be approximated by a monotonic sequence of break‐even values. When both arms are unknown, we derive a special case of optimality of the myopic strategy.  相似文献   

8.
《Optimization》2012,61(3):257-265
In this note we present a method to modify allocation rules which is useful for multi-armed bandit problems also for the finite-horizon case. The method is based on monotonicity properties of the value function and the structure of certain optimal policies. It is applied to the well-known allocation rule of Bather (1985). The performance of improvement is investigated by a numerical case study  相似文献   

9.
A bandit problem with side observations is an extension of the traditional two-armed bandit problem, in which the decision maker has access to side information before deciding which arm to pull. In this paper, essential properties of the side observations that allow achievability results with respect to optimal regret are extracted and formalized. The sufficient conditions for good side information obtained here admit various types of random processes as special cases, including i.i.d. sequences, Markov chains, deterministic periodic sequences, etc. A simple necessary condition for optimal regret is given, providing further insight into the nature of bandit problems with side observations. A game-theoretic approach simplifies the analysis and justifies the viewpoint that the side observation serves as an index specifying different sub-bandit machines.  相似文献   

10.
Obsolescence of embedded parts is a serious concern for managers of complex systems where the design life of the system typically exceeds 20 years. Capital asset management teams have been exploring several strategies to mitigate risks associated with Diminishing Manufacturing Sources (DMS) and repeated life extensions of complex systems. Asset management cost and the performance of a system depend heavily on the obsolescence mitigation strategy chosen by the decision maker. We have developed mathematical models that can be used to calculate the impact of various obsolescence mitigation strategies on the Total Cost of Ownership (TCO) of a system. We have used classical multi-arm bandit (MAB) and restless bandit models to identify the best strategy for managing obsolescence in such instances wherein organizations have to deal with continuous technological evolution under uncertainty. The results of dynamic programming and greedy heuristic are compared with Gittins index solution.  相似文献   

11.
We survey a new approach that the author and his co-workers have developed to formulate stochastic control problems (predominantly queueing systems) asmathematical programming problems. The central idea is to characterize the region of achievable performance in a stochastic control problem, i.e., find linear or nonlinear constraints on the performance vectors that all policies satisfy. We present linear and nonlinear relaxations of the performance space for the following problems: Indexable systems (multiclass single station queues and multiarmed bandit problems), restless bandit problems, polling systems, multiclass queueing and loss networks. These relaxations lead to bounds on the performance of an optimal policy. Using information from the relaxations we construct heuristic nearly optimal policies. The theme in the paper is the thesis that better formulations lead to deeper understanding and better solution methods. Overall the proposed approach for stochastic control problems parallels efforts of the mathematical programming community in the last twenty years to develop sharper formulations (polyhedral combinatorics and more recently nonlinear relaxations) and leads to new insights ranging from a complete characterization and new algorithms for indexable systems to tight lower bounds and nearly optimal algorithms for restless bandit problems, polling systems, multiclass queueing and loss networks.  相似文献   

12.
Bandits are a finite collection of random variables. Bandit problems are Markov decision problems in which, at each decision time, the decision maker selects a random variable (referred to as a bandit arm) and observes an outcome. The selection is based on the observation history. The objective is to sequentially choose arms so as to minimize growth (with decision time) rate of the number of suboptimal selections.The appellation bandit refers to mechanical gambling machines, and the tradition stems from the question of allocating competing treatments to a sequence of patients having the same disease. Our motivation is machine learning in which a game-playing or assembly-line adjusting computer is faced with a sequence of statistically-similar decision problems and, as resource, has access to an expanding data base relevant to these problems.The setting for the present study is nonparametric and infinite horizon. The central aim is to relate a methodology which postulates finite moments or, alternatively, bounded bandit arms. Under these circumstances, strategies proposed are shown to be asymptotically optimal and converge at guaranteed rates. In the bounded-arm case, the rate is optimal.We extend the theory to the case in which the bandit population is infinite, and share some computational experience.  相似文献   

13.
We provide a tight bound on the amount of experimentation under the optimal strategy in sequential decision problems. We show the applicability of the result by providing a bound on the cut-off in a one-arm bandit problem.  相似文献   

14.
Multi-Armed bandit problem revisited   总被引:1,自引:0,他引:1  
In this paper, we revisit aspects of the multi-armed bandit problem in the earlier work (Ref. 1). An alternative proof of the optimality of the Gittins index rule is derived under the discounted reward criterion. The proof does not involve an explicit use of the interchange argument. The ideas of the proof are extended to derive the asymptotic optimality of the index rule under the average reward criterion. Problems involving superprocesses and arm-acquiring bandits are also reexamined. The properties of an optimal policy for an arm-acquiring bandit are discussed.This research was supported by NSF Grant IRI-91-20074.  相似文献   

15.
王熙逵 《经济数学》2001,18(4):39-48
本文有两个目的.第一,对Bandit过程这一学科的主要概念及结果作一次系统性的介绍.第二,综述Bandit过程的模型,计算与应用的最新发展.本文刻画了Bandit过程与马氏决策规划的关系.通过考虑理论上或方法论上的局限,实际中或计算上的困难,以及应用中的限制.我们讨论一些重要的争端和公开的问题.  相似文献   

16.
Presented in this paper is a self-contained analysis of a Markov decision problem that is known as the multi-armed bandit. The analysis covers the cases of linear and exponential utility functions. The optimal policy is shown to have a simple and easily-implemented form. Procedures for computing such a policy are presented, as are procedures for computing the expected utility that it earns, given any starting state. For the case of linear utility, constraints that link the bandits are introduced, and the constrained optimization problem is solved via column generation. The methodology is novel in several respects, which include the use of elementary row operations to simplify arguments.  相似文献   

17.
Explicit formulae are obtained for the value and a stationary optimal policy in some cases of the continuous-time two-armed bandit problem with expected discounted reward.  相似文献   

18.
We evaluate the asymptotic performance of boundedly-rational strategies in multi-armed bandit problems, where performance is measured in terms of the tendency (in the limit) to play optimal actions in either (i) isolation or (ii) networks of other learners. We show that, for many strategies commonly employed in economics, psychology, and machine learning, performance in isolation and performance in networks are essentially unrelated. Our results suggest that the performance of various, common boundedly-rational strategies depends crucially upon the social context (if any) in which such strategies are to be employed.  相似文献   

19.
We study four proofs that the Gittins index priority rule is optimal for alternative bandit processes. These include Gittins’ original exchange argument, Weber’s prevailing charge argument, Whittle’s Lagrangian dual approach, and Bertsimas and Niño-Mora’s proof based on the achievable region approach and generalized conservation laws. We extend the achievable region proof to infinite countable state spaces, by using infinite dimensional linear programming theory.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号