论文标题
查找$ k $ in $ k $ -means的算法
Algorithms for finding $k$ in $k$-means
论文作者
论文摘要
$ k-$表示聚类需要作为输入$ k $的确切值,即集群数。两个挑战是开放的:(i)$ k $有数据确定的定义是正确的,并且(ii)(ii)是否有多项式时间算法可以从数据中找到$ k $?本文为这两个问题提供了第一个肯定答案。在文献中,我们假设数据允许群集中心分开的未知地面真相(GT)聚类。仅此假设就不足以回答(i)。我们假设一个新颖但自然的第二个约束,称为无紧密的亚群集(NTSC),它规定没有比群集比群集的GT群集的大量子集可以“更紧密”(从某种意义上说明我们定义)。我们对(i)和(ii)的答案是在这两个确定性假设之下。我们还提供多项式时间算法以识别$ K $。我们的算法依赖于NTSC,一次通过识别紧密包装的点来一次剥离一个群集。我们还能够证明我们的算法(S)适用于高斯混合物生成的数据,更普遍地适用于高斯PDF的混合物,因此能够从数据中找到混合物的组件数量。据我们所知,这些专业设置的先前结果也假定除了数据外,还要给出$ K $。
$k-$means Clustering requires as input the exact value of $k$, the number of clusters. Two challenges are open: (i) Is there a data-determined definition of $k$ which is provably correct and (ii) Is there a polynomial time algorithm to find $k$ from data ? This paper provides the first affirmative answers to both these questions. As common in the literature, we assume that the data admits an unknown Ground Truth (GT) clustering with cluster centers separated. This assumption alone is not sufficient to answer Yes to (i). We assume a novel, but natural second constraint called no tight sub-cluster (NTSC) which stipulates that no substantially large subset of a GT cluster can be "tighter" (in a sense we define) than the cluster. Our yes answer to (i) and (ii) are under these two deterministic assumptions. We also give polynomial time algorithm to identify $k$. Our algorithm relies on NTSC to peel off one cluster at a time by identifying points which are tightly packed. We are also able to show that our algorithm(s) apply to data generated by mixtures of Gaussians and more generally to mixtures of sub-Gaussian pdf's and hence are able to find the number of components of the mixture from data. To our knowledge, previous results for these specialized settings as well, assume generally that $k$ is given besides the data.
