论文标题
在网格世界马尔可夫决策过程中推断
Extrapolation in Gridworld Markov-Decision Processes
论文作者
论文摘要
加强学习中的外推是给定在训练时间永远不会发生的情况下概括的能力。在这里,我们考虑在简单的网格环境中导致推断的四个因素:(a)避免在测试时避免采取行动选择的最大Q值(或其他确定性方法),((b)以旋转和镜像的旋转式(而不是标准的distation distvients and buolditation and Invertiant)的旋转和镜像对称性(b)以自我为中心的表示,而不是标准的转化(c)损失函数的熵术语,以鼓励经常选择同样好的动作。
Extrapolation in reinforcement learning is the ability to generalize at test time given states that could never have occurred at training time. Here we consider four factors that lead to improved extrapolation in a simple Gridworld environment: (a) avoiding maximum Q-value (or other deterministic methods) for action choice at test time, (b) ego-centric representation of the Gridworld, (c) building rotational and mirror symmetry into the learning mechanism using rotational and mirror invariant convolution (rather than standard translation-invariant convolution), and (d) adding a maximum entropy term to the loss function to encourage equally good actions to be chosen equally often.
