期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Multi-armed bandit problems with multiple plays and switching cost

《Stochastics An International Journal of Probability and Stochastic Processes》2013,85(4):437-459

We consider multi-armed bandit problems with switching cost and multiple plays, define 相似文献

2.

Average optimality in a Poissonian bandit with switching arms

Doncho S. Donchev Alexander A. Yushkevich 《Mathematical Methods of Operations Research》1997,45(2):265-280

A symmetric Poissonian two-armed bandit becomes, in terms of a posteriori probabilities, a piecewise deterministic Markov decision process. For the case of the switching arms, only of one which creates rewards, we solve explicitly the average optimality equation and prove that a myopic policy is average optimal.Supported by NSF grant DMS-9404177 相似文献

3.

Algorithms for evaluating the dynamic allocation index

D.R. Robinson 《Operations Research Letters》1982,1(2):72-74

Gittins has shown that for a class of Markov decision processes called alternative bandit processes, optimal policies can easily be determined once the dynamic allocation indices (DAIs) for the constituent bandit processes are computed. Improved algorithms are presented for calculating DAIs both for general bandit processes and for the well-known special case of the multi-armed bandit problem. 相似文献

4.

Regret bounds for Narendra-Shapiro bandit algorithms

Sébastien Gadat Sofiane Saadane 《Stochastics An International Journal of Probability and Stochastic Processes》2018,90(6):886-926

Narendra-Shapiro (NS) algorithms are bandit-type algorithms developed in the 1960s. NS-algorithms have been deeply studied in infinite horizon but little non-asymptotic results exist for this type of bandit algorithms. In this paper, we focus on a non-asymptotic study of the regret and address the following question: are Narendra-Shapiro bandit algorithms competitive from this point of view? In our main result, we obtain some uniform explicit bounds for the regret of (over)-penalized-NS algorithms. We also extend to the multi-armed case some convergence properties of penalized-NS algorithms towards a stationary Piecewise Deterministic Markov Process (PDMP). Finally, we establish some new sharp mixing bounds for these processes. 相似文献

5.

Redesigning product lines in a period of economic crisis: a hybrid simulated annealing algorithm with crossover

Stelios Tsafarakis 《Annals of Operations Research》2016,240(2):617-640

Bandit products have captured significant market shares in China and have started to expand throughout the world. A striking feature of supply chains for bandit products is decentralization, where the upstream firm determines the product quality and the downstream firms compete on prices. We study the competition between a centralized mainstream firm and a decentralized bandit supply chain. We demonstrate that the structural difference between the mainstream firm and the bandit supply chain reduces competition intensity and the quality difference between their products. Surprisingly, the inherent inefficiency in a bandit supply chain, combined with the force of competition, actually leads to both higher product quality and higher price. Furthermore, due to the free-riding effect, the bandit supply chain may even offer higher quality products than the mainstream firm. The mainstream firm’s profit as a function of the free-riding effect is U-shaped, so that free-riding by the bandit supply chain may eventually benefit the mainstream firm. Finally, decentralization benefits the bandit supply chain when the competition is on product features. 相似文献

6.

One-armed bandit models with continuous and delayed responses 总被引：2，自引：0，他引：2

Xikui?Wang Email author Mikelis G.?Bickis 《Mathematical Methods of Operations Research》2003,58(2):209-219

One-armed bandit processes with continuous delayed responses are formulated as controlled stochastic processes following the Bayesian approach. It is shown that under some regularity conditions, a Gittins-like index exists which is the limit of a monotonic sequence of break-even values characterizing optimal initial selections of arms for finite horizon bandit processes. Furthermore, there is an optimal stopping solution when all observations on the unknown arm are complete. Results are illustrated with a bandit model having exponentially distributed responses, in which case the controlled stochastic process becomes a Markov decision process, the Gittins-like index is the Gittins index and the Gittins index strategy is optimal. Acknowledgement.We thank an anonymous referee for constructive and insightful comments, especially those related to the notion of the Gittins index.Both authors are funded by the Natural Sciences and Engineering Research Council (NSERC) of Canada. 相似文献

7.

A Bayesian two‐armed bandit model

Xikui Wang You Liang Lysa Porth 《商业与工业应用随机模型》2019,35(3):624-636

A two‐armed bandit model using a Bayesian approach is formulated and investigated in this paper with the goal of maximizing the value of a certain criterion of optimality. The bandit model illustrates the trade‐off between exploration and exploitation, where exploration means acquiring scientific acknowledge for better‐informed decisions at later stages (ie, maximizing long‐term benefit), and exploitation means applying the current knowledge for the best possible outcome at the current stage (ie, maximizing the immediate expected payoff). When one arm has known characteristics, stochastic dynamic programming is applied to characterize the optimal strategy and provide the foundation for its calculation. The results show that the celebrated Gittins index can be approximated by a monotonic sequence of break‐even values. When both arms are unknown, we derive a special case of optimality of the myopic strategy. 相似文献

8.

On the improvement of allocation rules for multi-armed bandit problem

《Optimization》2012,61(3):257-265

In this note we present a method to modify allocation rules which is useful for multi-armed bandit problems also for the finite-horizon case. The method is based on monotonicity properties of the value function and the structure of certain optimal policies. It is applied to the well-known allocation rule of Bather (1985). The performance of improvement is investigated by a numerical case study 相似文献

9.

Arbitrary side observations in bandit problems

Chih-Chun Wang Sanjeev R. Kulkarni H. Vincent Poor 《Advances in Applied Mathematics》2005,34(4):903

A bandit problem with side observations is an extension of the traditional two-armed bandit problem, in which the decision maker has access to side information before deciding which arm to pull. In this paper, essential properties of the side observations that allow achievability results with respect to optimal regret are extracted and formalized. The sufficient conditions for good side information obtained here admit various types of random processes as special cases, including i.i.d. sequences, Markov chains, deterministic periodic sequences, etc. A simple necessary condition for optimal regret is given, providing further insight into the nature of bandit problems with side observations. A game-theoretic approach simplifies the analysis and justifies the viewpoint that the side observation serves as an index specifying different sub-bandit machines. 相似文献

10.

Optimal selection of obsolescence mitigation strategies using a restless bandit model

U. Dinesh Kumar Haritha Saranga 《European Journal of Operational Research》2010

Obsolescence of embedded parts is a serious concern for managers of complex systems where the design life of the system typically exceeds 20 years. Capital asset management teams have been exploring several strategies to mitigate risks associated with Diminishing Manufacturing Sources (DMS) and repeated life extensions of complex systems. Asset management cost and the performance of a system depend heavily on the obsolescence mitigation strategy chosen by the decision maker. We have developed mathematical models that can be used to calculate the impact of various obsolescence mitigation strategies on the Total Cost of Ownership (TCO) of a system. We have used classical multi-arm bandit (MAB) and restless bandit models to identify the best strategy for managing obsolescence in such instances wherein organizations have to deal with continuous technological evolution under uncertainty. The results of dynamic programming and greedy heuristic are compared with Gittins index solution. 相似文献

11.

The achievable region method in the optimal control of queueing systems; formulations,bounds and policies

Dimitris Bertsimas 《Queueing Systems》1995,21(3-4):337-389

We survey a new approach that the author and his co-workers have developed to formulate stochastic control problems (predominantly queueing systems) asmathematical programming problems. The central idea is to characterize the region of achievable performance in a stochastic control problem, i.e., find linear or nonlinear constraints on the performance vectors that all policies satisfy. We present linear and nonlinear relaxations of the performance space for the following problems: Indexable systems (multiclass single station queues and multiarmed bandit problems), restless bandit problems, polling systems, multiclass queueing and loss networks. These relaxations lead to bounds on the performance of an optimal policy. Using information from the relaxations we construct heuristic nearly optimal policies. The theme in the paper is the thesis that better formulations lead to deeper understanding and better solution methods. Overall the proposed approach for stochastic control problems parallels efforts of the mathematical programming community in the last twenty years to develop sharper formulations (polyhedral combinatorics and more recently nonlinear relaxations) and leads to new insights ranging from a complete characterization and new algorithms for indexable systems to tight lower bounds and nearly optimal algorithms for restless bandit problems, polling systems, multiclass queueing and loss networks. 相似文献

12.

Nonparametric bandit methods

Sid Yakowitz Wing Lowe 《Annals of Operations Research》1991,28(1):297-312

Bandits are a finite collection of random variables. Bandit problems are Markov decision problems in which, at each decision time, the decision maker selects a random variable (referred to as a bandit arm) and observes an outcome. The selection is based on the observation history. The objective is to sequentially choose arms so as to minimize growth (with decision time) rate of the number of suboptimal selections.The appellation bandit refers to mechanical gambling machines, and the tradition stems from the question of allocating competing treatments to a sequence of patients having the same disease. Our motivation is machine learning in which a game-playing or assembly-line adjusting computer is faced with a sequence of statistically-similar decision problems and, as resource, has access to an expanding data base relevant to these problems.The setting for the present study is nonparametric and infinite horizon. The central aim is to relate a methodology which postulates finite moments or, alternatively, bounded bandit arms. Under these circumstances, strategies proposed are shown to be asymptotically optimal and converge at guaranteed rates. In the bounded-arm case, the rate is optimal.We extend the theory to the case in which the bandit population is infinite, and share some computational experience. 相似文献

13.

On the optimal amount of experimentation in sequential decision problems

Dinah Rosenberg Eilon Solan Nicolas Vieille 《Statistics & probability letters》2010,80(5-6):381-385

We provide a tight bound on the amount of experimentation under the optimal strategy in sequential decision problems. We show the applicability of the result by providing a bound on the cut-off in a one-arm bandit problem. 相似文献

14.

Multi-Armed bandit problem revisited 总被引：1，自引：0，他引：1

T. Ishikida P. Varaiya 《Journal of Optimization Theory and Applications》1994,83(1):113-154

In this paper, we revisit aspects of the multi-armed bandit problem in the earlier work (Ref. 1). An alternative proof of the optimality of the Gittins index rule is derived under the discounted reward criterion. The proof does not involve an explicit use of the interchange argument. The ideas of the proof are extended to derive the asymptotic optimality of the index rule under the average reward criterion. Problems involving superprocesses and arm-acquiring bandits are also reexamined. The properties of an optimal policy for an arm-acquiring bandit are discussed.This research was supported by NSF Grant IRI-91-20074. 相似文献

15.

Bandit过程及其应用

王熙逵《经济数学》2001,18(4):39-48

本文有两个目的.第一,对Bandit过程这一学科的主要概念及结果作一次系统性的介绍.第二,综述Bandit过程的模型,计算与应用的最新发展.本文刻画了Bandit过程与马氏决策规划的关系.通过考虑理论上或方法论上的局限,实际中或计算上的困难,以及应用中的限制.我们讨论一些重要的争端和公开的问题. 相似文献

16.

The multi-armed bandit, with constraints

Eric V. Denardo Eugene A. Feinberg Uriel G. Rothblum 《Annals of Operations Research》2013,208(1):37-62

Presented in this paper is a self-contained analysis of a Markov decision problem that is known as the multi-armed bandit. The analysis covers the cases of linear and exponential utility functions. The optimal policy is shown to have a simple and easily-implemented form. Procedures for computing such a policy are presented, as are procedures for computing the expected utility that it earns, given any starting state. For the case of linear utility, constraints that link the bandits are introduced, and the constrained optimization problem is solved via column generation. The methodology is novel in several respects, which include the use of elementary row operations to simplify arguments. 相似文献

17.

On the two-armed bandit problem with continuous time parameter and discounted rewards

《Stochastics An International Journal of Probability and Stochastic Processes》2013,85(3):299-310

Explicit formulae are obtained for the value and a stationary optimal policy in some cases of the continuous-time two-armed bandit problem with expected discounted reward. 相似文献

18.

Wisdom of crowds versus groupthink: learning in groups and in isolation

Conor Mayo-Wilson Kevin Zollman David Danks 《International Journal of Game Theory》2013,42(3):695-723

We evaluate the asymptotic performance of boundedly-rational strategies in multi-armed bandit problems, where performance is measured in terms of the tendency (in the limit) to play optimal actions in either (i) isolation or (ii) networks of other learners. We show that, for many strategies commonly employed in economics, psychology, and machine learning, performance in isolation and performance in networks are essentially unrelated. Our results suggest that the performance of various, common boundedly-rational strategies depends crucially upon the social context (if any) in which such strategies are to be employed. 相似文献

19.

A general rapid network design,line planning and fleet investment integrated model

David Canca Alicia De-Los-Santos Gilbert Laporte Juan A. Mesa 《Annals of Operations Research》2016,241(1-2):127-165

We study four proofs that the Gittins index priority rule is optimal for alternative bandit processes. These include Gittins’ original exchange argument, Weber’s prevailing charge argument, Whittle’s Lagrangian dual approach, and Bertsimas and Niño-Mora’s proof based on the achievable region approach and generalized conservation laws. We extend the achievable region proof to infinite countable state spaces, by using infinite dimensional linear programming theory. 相似文献

20.

Dynamic productivity improvement in a model with multiple processes

Michael Brock Jørgen Tind 《Mathematical Methods of Operations Research》2001,54(3):387-393

相似文献