论文标题
近似ULAM指标下的中位数
Approximating the Median under the Ulam Metric
论文作者
论文摘要
我们研究了\ emph {中间字符串}问题的变体的近似算法,该算法要求一个字符串,该字符串最大程度地减少了从给定的$ m $ length $ n $的$ m $ strings的编辑距离之和。只有直接的$ 2 $ -Approximation以此NP问题而闻名。这个问题是由计算生物学激励的,并且属于中值问题(在不同的度量空间上),这是数据分析中的基本任务。 我们的主要结果是用于ULAM指标,其中所有字符串都是$ [n] $的排列,每个编辑操作都会移动一个符号(删除和插入)。我们为这个问题设计了一种算法,它破坏了$ 2 $ - approximation屏障,即计算$(2-δ)$ - 某些常数$δ> 0 $ nime $ \ tilde {o}(o}(nm^2+n^3)$的$(2-Δ)$ - 近似中位数。在特殊情况下,我们进一步使用这些技术来实现$(2-δ)$近似值,即中位数仅限于长度$ n $,最佳目标是$ω(MN)$。 我们还针对ULAM中位数的以下概率模型设计了一种近似算法:输入由(未知)置换$ x $的$ m $扰动组成,每个符号是通过以概率(参数)$ε> 0 $将每个符号移动到随机位置而生成的。我们的算法以高概率为$(1+o(1/ε))$计算 - 时间$ O(Mn^2+n^3)$的近似中位数。
We study approximation algorithms for variants of the \emph{median string} problem, which asks for a string that minimizes the sum of edit distances from a given set of $m$ strings of length $n$. Only the straightforward $2$-approximation is known for this NP-hard problem. This problem is motivated e.g.~by computational biology, and belongs to the class of median problems (over different metric spaces), which are fundamental tasks in data analysis. Our main result is for the Ulam metric, where all strings are permutations over $[n]$ and each edit operation moves a symbol (deletion plus insertion). We devise for this problem an algorithms that breaks the $2$-approximation barrier, i.e., computes a $(2-δ)$-approximate median permutation for some constant $δ>0$ in time $\tilde{O}(nm^2+n^3)$. We further use these techniques to achieve a $(2-δ)$ approximation for the median string problem in the special case where the median is restricted to length $n$ and the optimal objective is large $Ω(mn)$. We also design an approximation algorithm for the following probabilistic model of the Ulam median: the input consists of $m$ perturbations of an (unknown) permutation $x$, each generated by moving every symbol to a random position with probability (a parameter) $ε>0$. Our algorithm computes with high probability a $(1+o(1/ε))$-approximate median permutation in time $O(mn^2+n^3)$.
