Abstract
Distributed stream processing systems must function efficiently for data streams that fluctuate in their arrival rates and data distributions. Yet repeated and prohibitively expensive load reallocation across machines may make these systems ineffective, potentially resulting in data loss or even system failure. To overcome this problem, we propose a comprehensive solution, called the Robust Load Distribution (RLD) strategy, that is resilient under data fluctuations. RLD provides ϵ-optimal query performance under an expected range of load fluctuations without suffering from the performance penalty caused by load migration. RLD is based on three key strategies. First, we model robust distributed stream processing as a parametric query optimization problem in a parameter space that captures the stream fluctuations. The notions of both robust logical and robust physical plans that work together to proactively handle all ranges of expected fluctuations in parameters are abstracted as overlays of this parameter space. Second, our Early-terminated Robust Partitioning (ERP) finds a combination of robust logical plans that together cover the parameter space, while minimizing the number of prohibitively expensive optimizer calls with a probabilistic bound on the space coverage. Third, we design a family of algorithms for physical plan generation. Our GreedyPhy exploits a probabilistic model to efficiently find a robust physical plan that sustains most frequently used robust logical plans at runtime. Our CorPhy algorithm exploits operator correlations for the robust physical plan optimization. The resulting physical plan smooths the workload on each node under all expected fluctuations. Our OptPrune algorithm, using CorPhy as baseline, is guaranteed to find the optimal physical plan that maximizes the parameter space coverage with a practical increase in optimization time. Lastly, we further expand the capabilities of our proposed RLD framework to also appropriately react under so-called “space drifts”, that is, a space drift is a change of the parameter space where the observed runtime statistics deviate from the expected optimization-time statistics. Our RLD solution is capable of adjusting itself to the unexpected yet significant data fluctuations beyond those planned for via covering the parameter space. Our experimental study using stock market and sensor network streams demonstrates that our RLD methodology consistently outperforms state-of-the-art solutions in terms of efficiency and effectiveness in highly fluctuating data stream environments.
- Daniel J. Abadi, Yanif Ahmad, Magdalena Balazinska, Mitch Cherniack, Jeong Hyon Hwang, Wolfgang Lindner, Anurag S. Maskey, Er Rasin, Esther Ryvkina, Nesime Tatbul, Ying Xing, and Stan Zdonik. 2005. The design of the borealis stream processing engine. In Proceedings of the Conference on Innovative Data Systems Research (CIDR'05).Google Scholar
- Daniel J. Abadi, Don Carney, Ugur Cetintemel, Mitch Cherniack, Christian Convey, et al. 2003. Aurora: A new model and architecture for data stream management. VLDB J. 12, 2, 120--139. Google ScholarDigital Library
- David M. Allen. 1968. Handbook of methods of applied statistics. Technometrics 10, 4, 872--873.Google ScholarCross Ref
- Brian Babcock and Surajit Chaudhuri. 2005. Towards a robust query optimizer: A principled and practical approach. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'05). 119--130. Google ScholarDigital Library
- Shivnath Babu, Pedro Bizarro, and David Dewitt. 2005. Proactive re-optimization. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'05). 107--118. Google ScholarDigital Library
- Shivnath Babu, Rajeev Motwani, Kamesh Munagala, Itaru Nishizawa, and Jennifer Widom. 2004. Adaptive ordering of pipelined stream filters. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'04). 407--418. Google ScholarDigital Library
- Magdalena Balazinska, Hari Balakrishnan, and Mike Stonebraker. 2004. Contract-based load management in federated distributed systems. In Proceedings of the 1st Symposium on Networked Systems Design and Implementation (NSDI'04). 15--15. Google ScholarDigital Library
- Dimitris Bertsimas and Omid Nohadani. 2010. Robust optimization with simulated annealing. J. Global Optim. 48, 2, 323--334. Google ScholarDigital Library
- Pedro Bizarro, Nicolas Bruno, and David J. Dewitt. 2009. Progressive parametric query optimization. IEEE Trans. Knowl. Data Engin. 21, 4, 582--594. Google ScholarDigital Library
- Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J. Franklin, Joseph M. Helerstein, et al. 2003. TelegraphCQ: Continuous dataflow processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'03). 668--668. Google ScholarDigital Library
- Surajit Chaudhuri, Arnd Christian Konig, and Vivek Narasayya. 2004. SQLCM: A continuous monitoring framework for relational database engines. In Proceedings of the 20th International Conference on Data Engineering (ICDE'04). 473--485. Google ScholarDigital Library
- Surajit Chaudhuri, Hongrae Lee, and Vivek R. Narasayya. 2010. Variance aware optimization of parameterized queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 531--542. Google ScholarDigital Library
- Ming Chen, Hui Zhang, Ya-Yunn Su, Xiaorui Wang, Guofei Jiang, and Kenji Yoshihira. 2011. Effective vm sizing in virtualized data centers. In Proceedings of the IFIP/IEEE International Symposium on Integrated Network Management. 594--601.Google ScholarCross Ref
- Mitch Cherniack, Hari Balakrishnan, Magdalena Balazinska, Don Carney, Ugur Cetintemel, et al. 2003. Scalable distributed stream processing. In Proceedings of the Conference on Innovative Data Systems Research (CIDR'03). 257--268.Google Scholar
- Thomas M. Cover and Joy A. Thomas. 1991. Elements of Information Theory. Wiley-Interscience, New York. Google ScholarDigital Library
- Harish D, Pooja N. Darera, and Jayant R. Haritsa. 2008. Identifying robust plans through plan diagram reduction. Proc. VLDB Endow. 1, 1, 1124--1140. Google ScholarDigital Library
- Ralf Diekman and Robert Preis. 1999. Load Balancing Strategies for Distributed Memory Machines. Civil-Comp Press, 124--157. Google ScholarDigital Library
- Goetz Graefe and Karen Ward. 1989. Dynamic query evaluation plans. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'89). 358--366. Google ScholarDigital Library
- Kien A. Hua, Yo Lung Lo, and Honesty C. Young. 1993. Considering data skew factor in multi-way join query optimization for parallel execution. VLDB J. 2, 3, 303--330. Google ScholarDigital Library
- Yannis E. Ioannidis and Stavros Christodoulakis. 1993. Optimal histograms for limiting worst-case error propagation in the size of join results. ACM Trans. Database Syst. 18, 4, 709--748. Google ScholarDigital Library
- Yannis E. Ioannidis, Raymond T. Ng, Kyuseok Shim, and Timos K. Sellis. 1992. Parametric query optimization. VLDB. J. 6, 2, 132--151. Google ScholarDigital Library
- Navin Kabra and David J. DeWitt. 1998. Efficient mid-query re-optimization of sub-optimal query execution plans. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'98). 106--117. Google ScholarDigital Library
- Evangelia Kalyvianaki, Wolfram Wiesemann, Quang Hieu Vu, Daniel Kuhn, and Peter Pietzuch. 2011. SQPR: Stream query planning with reuse. In Proceedings of the International Conference on Data Engineering (ICDE'11). 840--851. Google ScholarDigital Library
- Donald Kossmann. 2000. The state of the art in distributed query processing. ACM Comput. Surv. 32, 422--469. Google ScholarDigital Library
- Chuan Lei, Elke A. Rundensteiner, and Joshua D. Guttman. 2013. Robust distributed stream processing. In Proceedings of the International Conference on Data Engineering (ICDE'13). 817--828. Google ScholarDigital Library
- Richard J. Lipton, Jeffrey F. Naughton, Donovan A. Schneider, and Sridhar Seshadri. 1993. Efficient sampling strategies for relational database operations. In Proceedings of the Selected Papers of the 4th International Conference on Database Theory (ICDT'93). Elsevier Science Publishers Ltd., 195--226. Google ScholarDigital Library
- Andrea Lodi, Silvano Martello, and Michele Monaci. 2002. Two-dimensional packing problems: A survey. Euro. J. Oper. Res. 141, 2, 241--252.Google ScholarCross Ref
- Volker Markl, Vijayshankar Raman, David Simmen, Guy Lohman, and Hamid Pirahesh. 2004. Robust query processing through progressive optimization. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'04). 659--670. Google ScholarDigital Library
- Thomas M. Mitchell. 1997. Machine Learning 1st Ed. McGraw-Hill, New York. Google ScholarDigital Library
- Rajeev Motwani, Jennifer Widom, Arvind Arasu, Brian Babcock, Shivnath Babu, et al. 2003. Query processing, approximation, and resource management in a data stream management system. In Proceedings of the Conference on Innovative Data Systems Research (CIDR'03). 245--256.Google Scholar
- Rimma V. Nehme, Elke A. Rundensteiner, and Elisa Bertino. 2009. Self-tuning query mesh for adaptive multi-route query processing. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology (EDBT'09). 803--814. Google ScholarDigital Library
- Athanasios Papoulis. 1984. Probability, Random Variables, and Stochastic Processes. McGraw Hill.Google Scholar
- Liping Peng, Yanlei Diao, and Anna Liu. 2011. Optimizing probabilistic query processing on continuous uncertain data. Proc. VLDB Endow. 4, 11, 1169--1180.Google ScholarDigital Library
- Naveen Reddy and Jayant R. Haritsa. 2005. Analyzing plan diagrams of database query optimizers. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB'05). 1228--1239. Google ScholarDigital Library
- Mehul A. Shah, Joseph M. Hellerstein, Sirish Chandrasekaran, and Michael J. Franklin. 2003. Flux: An adaptive partitioning operator for continuous query systems. In Proceedings of the 19th International Conference on Data Engineering (ICDE'03). 25--36.Google Scholar
- Behrooz A. Shirazi, Krishna M. Kavi, and Ali R. Hurson, Eds. 1995. Scheduling and Load Balancing in Parallel and Distributed Systems. IEEE Computer Society Press. Google ScholarDigital Library
- Timothy M. Sutherland, Bin Liu, Mariana Jbantova, and Elke A. Rundensteiner. 2005. D-CAPE: Distributed and self-tuned continuous query processing. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM'05). 217--218. Google ScholarDigital Library
- Feng Tian and David J. DeWitt. 2003. Tuple routing strategies for distributed eddies. In Proceedings of the 29th International Conference on Very Large Data Bases (VLDB'03), vol. 29. 333--344. Google ScholarDigital Library
- TradingMarkets. 2013. http://www.tradingmarkets.com/.Google Scholar
- Haixun Wang and Jian Pei. 2005. A random method for quantifying changing distributions in data streams. In Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'05). 684--691. Google ScholarDigital Library
- Joel Wolf, Nikhil Bansal, Kirsten Hildrum, Sujay Parekh, Deepak Rajan, Rohit Wagle, Kun-Lung Wu, and Lisa Fleischer. 2008. SODA: An optimizing scheduler for large-scale stream-based distributed computer systems. In Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware (Middleware'08). 306--325. Google ScholarDigital Library
- Ying Xing, Jeong-Hyon Hwang, Ugur Cetintemel, and Stan Zdonik. 2006. Providing resiliency to load variations in distributed stream processing. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB'06). 775--786. Google ScholarDigital Library
- Ying Xing, Stan Zdonik, and Jeong-Hyon Hwang. 2005. Dynamic load distribution in the borealis stream processor. In Proceedings of the 21st International Conference on Data Engineering (ICDE'05). 791--802. Google ScholarDigital Library
- Yali Zhu, Elke A. Rundensteiner, and George T. Heineman. 2004. Dynamic plan migration for continuous queries over data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'04). 431--442. Google ScholarDigital Library
Index Terms
- Robust Distributed Query Processing for Streaming Data
Recommendations
On Evaluating the Impact of Changes in IoT Data Streams Rate over Query Window Configurations
DEBS '19: Proceedings of the 13th ACM International Conference on Distributed and Event-based SystemsWith the ever increasing number of IoT devices getting connected, an enormous amount of streaming data is being produced with very high velocity. In order to process these large number of data streams, a variety of stream processing platforms and query ...
Query processing of streamed XML data
CIKM '02: Proceedings of the eleventh international conference on Information and knowledge managementWe are addressing the efficient processing of continuous XML streams, in which the server broadcasts XML data to multiple clients concurrently through a multicast data stream, while each client is fully responsible for processing the stream. In our ...
RMLStreamer-SISO: An RDF Stream Generator from Streaming Heterogeneous Data
The Semantic Web – ISWC 2022AbstractStream-reasoning query languages such as CQELS and C-SPARQL enable query answering over RDF streams. Unfortunately, there currently is a lack of efficient RDF stream generators to feed RDF stream reasoners. State-of-the-art RDF stream generators ...
Comments