论文标题
在未知的多面体上安全的线性匪徒
Safe Linear Bandits over Unknown Polytopes
论文作者
论文摘要
安全的线性匪徒问题(SLB)是一种在线编程的方法,具有未知的目标和未知的圆周约束,在奖励和安全行动安全风险的随机匪徒反馈下。我们研究了SLB的功效和平稳的安全成本与多面化的平滑安全成本之间的权衡,以及在避免现存的令人讨厌的令人讨厌的方法做出的强烈假设方面,具有侵略性的双重优势玩法的作用。 我们首先阐明了SLB中固有的硬度,这是由于缺乏约束知识:存在“简单”实例,次优的极端点具有较大的“差距”,但是SLB方法仍然必须遇到$ω(\ sqrt {t})(t})$遗憾或安全违规行为,因为无效地解决了仲裁精确的精确性。 We then analyse a natural doubly-optimistic strategy for the safe linear bandit problem, DOSS, which uses optimistic estimates of both reward and safety risks to select actions, and show that despite the lack of knowledge of constraints or feasible points, DOSS simultaneously obtains tight instance-dependent $O(\log^2 T)$ bounds on efficacy regret, and $\tilde O(\sqrt{T})$ bounds关于安全违规。此外,当要求安全到有限的精度时,违规行为提高到$ o(\ log^2 t)。$这些结果取决于对线性匪徒的新颖双重分析:我们认为,\ algoname会通过激活至少$ d $约束的噪声来进行,从而使我们分别分析了一个良好的结构,这是一个良好的结构,并且在某些情况下进行了分析。激活。基于线性程序的全局灵敏度分析,通过开发新的双重差距来控制前者的成本,以$ o(\ log^2 t)$控制,从而量化了每组此类约束集的次优。通过明确分析乐观游戏的解决方案,后者的成本可控制至$ o(1)$。
The safe linear bandit problem (SLB) is an online approach to linear programming with unknown objective and unknown roundwise constraints, under stochastic bandit feedback of rewards and safety risks of actions. We study the tradeoffs between efficacy and smooth safety costs of SLBs over polytopes, and the role of aggressive doubly-optimistic play in avoiding the strong assumptions made by extant pessimistic-optimistic approaches. We first elucidate an inherent hardness in SLBs due the lack of knowledge of constraints: there exist `easy' instances, for which suboptimal extreme points have large `gaps', but on which SLB methods must still incur $Ω(\sqrt{T})$ regret or safety violations, due to an inability to resolve unknown optima to arbitrary precision. We then analyse a natural doubly-optimistic strategy for the safe linear bandit problem, DOSS, which uses optimistic estimates of both reward and safety risks to select actions, and show that despite the lack of knowledge of constraints or feasible points, DOSS simultaneously obtains tight instance-dependent $O(\log^2 T)$ bounds on efficacy regret, and $\tilde O(\sqrt{T})$ bounds on safety violations. Further, when safety is demanded to a finite precision, violations improve to $O(\log^2 T).$ These results rely on a novel dual analysis of linear bandits: we argue that \algoname proceeds by activating noisy versions of at least $d$ constraints in each round, which allows us to separately analyse rounds where a `poor' set of constraints is activated, and rounds where `good' sets of constraints are activated. The costs in the former are controlled to $O(\log^2 T)$ by developing new dual notions of gaps, based on global sensitivity analyses of linear programs, that quantify the suboptimality of each such set of constraints. The latter costs are controlled to $O(1)$ by explicitly analysing the solutions of optimistic play.
