Fast Approximate Correlation for Massive Time-series Data

Abdullah Mueen; Suman Nath; Jie Liu

Fast Approximate Correlation for Massive Time-series Data

Abdullah Mueen ,
Suman Nath ,
Jie Liu

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD international conference on Management of data | January 2010

Published by Association for Computing Machinery, Inc.

Publication

Download BibTex

We consider the problem of computing all-pair correlations in a warehouse containing a large number (e.g., tens of thousands) of time-series (or, signals). The problem arises in automatic discovery of patterns and anomalies in data intensive applications such as data center management, environmental monitoring, and scientiﬁc experiments. However, with existing techniques, solving the problem for a large stream warehouse is extremely expensive, due to the problem’s inherent quadratic I/O and CPU complexities. We propose novel algorithms, based on Discrete Fourier Transformation (DFT) and graph partitioning, to reduce the end-to-end response time of an all-pair correlation query. To minimize I/O cost, we partition a massive set of input signals into smaller batches such that caching the signals one batch at a time maximizes data reuse and minimizes disk I/O. To reduce CPU cost, we propose two approximation algorithms. Our ﬁrst algorithm efﬁciently computes approximate correlation coefﬁcients of similar signal pairs within a given error bound. The second algorithm efﬁciently identiﬁes, without any false positives or negatives, all signal pairs with correlations above a given threshold. For many real applications, our approximate solutions are as useful as corresponding exact solutions, due to our strict error guarantees. However, compared to the state-of-the-art exact algorithms, our algorithms are up to 17× faster for several real datasets

Copyright © 2007 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org. The definitive version of this paper can be found at ACM's Digital Library --http://www.acm.org/dl/.