TY - GEN
T1 - Detection of highly correlated live data streams
AU - Alseghayer, Rakan
AU - Petrov, Daniel
AU - Chrysanthis, Panos K.
AU - Sharaf, Mohamed
AU - Labrinidis, Alexandros
N1 - Publisher Copyright:
© 2017 ACM.
PY - 2017/8/28
Y1 - 2017/8/28
N2 - More and more organizations (commercial, health, government and security) currently base their decisions on real-time analysis of fast arriving, large volumes of data streams. For such analysis to lead to actionable information in real-time and at the right time, the most recent data needs to be processed within a specified delay target. Effective solutions for analysis of such data streams rely on two techniques, (1) incremental sliding-window computation of aggregates, to avoid unnecessary recomputations and (2) intelligent scheduling of computational steps and operations. In this paper, we propose a solution that combines both of these techniques to find highly correlated data streams in real-time, using the Pearson Correlation Coefficient as a correlation metric for two windows of data streams. Specifically, we propose to partition a set of data streams into micro-batches that capture the delay target, use sliding windows within a range as the subsequences of values exhibiting a certain level of correlation, utilize the idea of sufficient statistics to incrementally compute the Pearson Correlation Coefficient of pairs of sliding windows, and adopt a deadline-aware priority scheduling to detect the highly correlated pairs of data streams.Our experimental results show that our scheme and in particular our Price-DCS with warm start scheduling algorithm outperform existing ones and enable high degree of interactivity in correlating live data streams micro-batches.
AB - More and more organizations (commercial, health, government and security) currently base their decisions on real-time analysis of fast arriving, large volumes of data streams. For such analysis to lead to actionable information in real-time and at the right time, the most recent data needs to be processed within a specified delay target. Effective solutions for analysis of such data streams rely on two techniques, (1) incremental sliding-window computation of aggregates, to avoid unnecessary recomputations and (2) intelligent scheduling of computational steps and operations. In this paper, we propose a solution that combines both of these techniques to find highly correlated data streams in real-time, using the Pearson Correlation Coefficient as a correlation metric for two windows of data streams. Specifically, we propose to partition a set of data streams into micro-batches that capture the delay target, use sliding windows within a range as the subsequences of values exhibiting a certain level of correlation, utilize the idea of sufficient statistics to incrementally compute the Pearson Correlation Coefficient of pairs of sliding windows, and adopt a deadline-aware priority scheduling to detect the highly correlated pairs of data streams.Our experimental results show that our scheme and in particular our Price-DCS with warm start scheduling algorithm outperform existing ones and enable high degree of interactivity in correlating live data streams micro-batches.
KW - Correlation
KW - Data exploration
KW - Data streams
KW - Search
KW - Subsequence
UR - http://www.scopus.com/inward/record.url?scp=85030325175&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85030325175&partnerID=8YFLogxK
U2 - 10.1145/3129292.3129298
DO - 10.1145/3129292.3129298
M3 - Conference contribution
AN - SCOPUS:85030325175
T3 - ACM International Conference Proceeding Series
BT - Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics, BIRTE 2017
PB - Association for Computing Machinery
T2 - 11th International Workshop on Real-Time Business Intelligence and Analytics, BIRTE 2017
Y2 - 28 August 2017
ER -