论文标题
许多处理器,很少的时间:通过最佳传输耦合进行分区的MCMC
Many processors, little time: MCMC for partitions via optimal transport couplings
论文作者
论文摘要
马尔可夫链蒙特卡洛(MCMC)方法通常用于聚类,因为它们保证在无限时间限制中渐近确切的期望。但是,在有限的时间里,缓慢的混合通常会导致性能差。现代计算环境提供了巨大的并行性,但是平行MCMC的幼稚实现会表现出很大的偏见。在连续随机变量的MCMC采样器中,马尔可夫链耦合可以克服偏差。但是这些方法在少量过渡后至关重要地取决于配对的连锁链会议。我们表明,现有耦合想法对离散聚类变量的直接应用无法快速满足。这种故障源于“标签开关问题”:语义上等效的聚类重新标记阻碍了耦合链的快速会议。相反,我们将链条视为探索分区的空间,而不是分区的(任意)标签。使用分区空间上的度量标准,我们使用最佳传输耦合制定了实用算法。我们的理论证实我们的方法是准确有效的。在从基因或种子的聚类到图形色素的实验中,我们在高度平行的,限时的状态下显示了耦合的好处。
Markov chain Monte Carlo (MCMC) methods are often used in clustering since they guarantee asymptotically exact expectations in the infinite-time limit. In finite time, though, slow mixing often leads to poor performance. Modern computing environments offer massive parallelism, but naive implementations of parallel MCMC can exhibit substantial bias. In MCMC samplers of continuous random variables, Markov chain couplings can overcome bias. But these approaches depend crucially on paired chains meetings after a small number of transitions. We show that straightforward applications of existing coupling ideas to discrete clustering variables fail to meet quickly. This failure arises from the "label-switching problem": semantically equivalent cluster relabelings impede fast meeting of coupled chains. We instead consider chains as exploring the space of partitions rather than partitions' (arbitrary) labelings. Using a metric on the partition space, we formulate a practical algorithm using optimal transport couplings. Our theory confirms our method is accurate and efficient. In experiments ranging from clustering of genes or seeds to graph colorings, we show the benefits of our coupling in the highly parallel, time-limited regime.
