论文标题
数据系列索引并行
Data Series Indexing Gone Parallel
论文作者
论文摘要
数据系列相似性搜索是跨许多不同域的多个数据系列分析应用程序的核心操作。但是,最新技术无法提供交互式探索所需的时间性能,或分析大数据系列集合。在此博士学位工作,我们介绍了第一个数据系列索引解决方案,包括磁盘和内存数据,这些解决方案旨在固有地利用多核体系结构,以加速相似性搜索处理时间。我们对各种合成数据和实际数据的实验表明,我们的方法比替代方案要快的数量级。更具体地说,我们的磁盘解决方案可以在几秒钟内回答100GB数据集上的确切相似性搜索查询,而我们的内存解决方案则以几毫秒为单位,可以在非常大的数据系列集合上实时,交互式数据探索。
Data series similarity search is a core operation for several data series analysis applications across many different domains. However, the state-of-the-art techniques fail to deliver the time performance required for interactive exploration, or analysis of large data series collections. In this Ph.D. work, we present the first data series indexing solutions, for both on-disk and in-memory data, that are designed to inherently take advantage of multi-core architectures, in order to accelerate similarity search processing times. Our experiments on a variety of synthetic and real data demonstrate that our approaches are up to orders of magnitude faster than the alternatives. More specifically, our on-disk solution can answer exact similarity search queries on 100GB datasets in a few seconds, and our in-memory solution in a few milliseconds, which enables real-time, interactive data exploration on very large data series collections.
