skip to main content
research-article

Robust Distributed Query Processing for Streaming Data

Published:26 May 2014Publication History
Skip Abstract Section

Abstract

Distributed stream processing systems must function efficiently for data streams that fluctuate in their arrival rates and data distributions. Yet repeated and prohibitively expensive load reallocation across machines may make these systems ineffective, potentially resulting in data loss or even system failure. To overcome this problem, we propose a comprehensive solution, called the Robust Load Distribution (RLD) strategy, that is resilient under data fluctuations. RLD provides ϵ-optimal query performance under an expected range of load fluctuations without suffering from the performance penalty caused by load migration. RLD is based on three key strategies. First, we model robust distributed stream processing as a parametric query optimization problem in a parameter space that captures the stream fluctuations. The notions of both robust logical and robust physical plans that work together to proactively handle all ranges of expected fluctuations in parameters are abstracted as overlays of this parameter space. Second, our Early-terminated Robust Partitioning (ERP) finds a combination of robust logical plans that together cover the parameter space, while minimizing the number of prohibitively expensive optimizer calls with a probabilistic bound on the space coverage. Third, we design a family of algorithms for physical plan generation. Our GreedyPhy exploits a probabilistic model to efficiently find a robust physical plan that sustains most frequently used robust logical plans at runtime. Our CorPhy algorithm exploits operator correlations for the robust physical plan optimization. The resulting physical plan smooths the workload on each node under all expected fluctuations. Our OptPrune algorithm, using CorPhy as baseline, is guaranteed to find the optimal physical plan that maximizes the parameter space coverage with a practical increase in optimization time. Lastly, we further expand the capabilities of our proposed RLD framework to also appropriately react under so-called “space drifts”, that is, a space drift is a change of the parameter space where the observed runtime statistics deviate from the expected optimization-time statistics. Our RLD solution is capable of adjusting itself to the unexpected yet significant data fluctuations beyond those planned for via covering the parameter space. Our experimental study using stock market and sensor network streams demonstrates that our RLD methodology consistently outperforms state-of-the-art solutions in terms of efficiency and effectiveness in highly fluctuating data stream environments.

References

  1. Daniel J. Abadi, Yanif Ahmad, Magdalena Balazinska, Mitch Cherniack, Jeong Hyon Hwang, Wolfgang Lindner, Anurag S. Maskey, Er Rasin, Esther Ryvkina, Nesime Tatbul, Ying Xing, and Stan Zdonik. 2005. The design of the borealis stream processing engine. In Proceedings of the Conference on Innovative Data Systems Research (CIDR'05).Google ScholarGoogle Scholar
  2. Daniel J. Abadi, Don Carney, Ugur Cetintemel, Mitch Cherniack, Christian Convey, et al. 2003. Aurora: A new model and architecture for data stream management. VLDB J. 12, 2, 120--139. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. David M. Allen. 1968. Handbook of methods of applied statistics. Technometrics 10, 4, 872--873.Google ScholarGoogle ScholarCross RefCross Ref
  4. Brian Babcock and Surajit Chaudhuri. 2005. Towards a robust query optimizer: A principled and practical approach. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'05). 119--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Shivnath Babu, Pedro Bizarro, and David Dewitt. 2005. Proactive re-optimization. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'05). 107--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Shivnath Babu, Rajeev Motwani, Kamesh Munagala, Itaru Nishizawa, and Jennifer Widom. 2004. Adaptive ordering of pipelined stream filters. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'04). 407--418. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Magdalena Balazinska, Hari Balakrishnan, and Mike Stonebraker. 2004. Contract-based load management in federated distributed systems. In Proceedings of the 1st Symposium on Networked Systems Design and Implementation (NSDI'04). 15--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Dimitris Bertsimas and Omid Nohadani. 2010. Robust optimization with simulated annealing. J. Global Optim. 48, 2, 323--334. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Pedro Bizarro, Nicolas Bruno, and David J. Dewitt. 2009. Progressive parametric query optimization. IEEE Trans. Knowl. Data Engin. 21, 4, 582--594. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J. Franklin, Joseph M. Helerstein, et al. 2003. TelegraphCQ: Continuous dataflow processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'03). 668--668. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Surajit Chaudhuri, Arnd Christian Konig, and Vivek Narasayya. 2004. SQLCM: A continuous monitoring framework for relational database engines. In Proceedings of the 20th International Conference on Data Engineering (ICDE'04). 473--485. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Surajit Chaudhuri, Hongrae Lee, and Vivek R. Narasayya. 2010. Variance aware optimization of parameterized queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 531--542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ming Chen, Hui Zhang, Ya-Yunn Su, Xiaorui Wang, Guofei Jiang, and Kenji Yoshihira. 2011. Effective vm sizing in virtualized data centers. In Proceedings of the IFIP/IEEE International Symposium on Integrated Network Management. 594--601.Google ScholarGoogle ScholarCross RefCross Ref
  14. Mitch Cherniack, Hari Balakrishnan, Magdalena Balazinska, Don Carney, Ugur Cetintemel, et al. 2003. Scalable distributed stream processing. In Proceedings of the Conference on Innovative Data Systems Research (CIDR'03). 257--268.Google ScholarGoogle Scholar
  15. Thomas M. Cover and Joy A. Thomas. 1991. Elements of Information Theory. Wiley-Interscience, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Harish D, Pooja N. Darera, and Jayant R. Haritsa. 2008. Identifying robust plans through plan diagram reduction. Proc. VLDB Endow. 1, 1, 1124--1140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ralf Diekman and Robert Preis. 1999. Load Balancing Strategies for Distributed Memory Machines. Civil-Comp Press, 124--157. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Goetz Graefe and Karen Ward. 1989. Dynamic query evaluation plans. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'89). 358--366. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kien A. Hua, Yo Lung Lo, and Honesty C. Young. 1993. Considering data skew factor in multi-way join query optimization for parallel execution. VLDB J. 2, 3, 303--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Yannis E. Ioannidis and Stavros Christodoulakis. 1993. Optimal histograms for limiting worst-case error propagation in the size of join results. ACM Trans. Database Syst. 18, 4, 709--748. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Yannis E. Ioannidis, Raymond T. Ng, Kyuseok Shim, and Timos K. Sellis. 1992. Parametric query optimization. VLDB. J. 6, 2, 132--151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Navin Kabra and David J. DeWitt. 1998. Efficient mid-query re-optimization of sub-optimal query execution plans. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'98). 106--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Evangelia Kalyvianaki, Wolfram Wiesemann, Quang Hieu Vu, Daniel Kuhn, and Peter Pietzuch. 2011. SQPR: Stream query planning with reuse. In Proceedings of the International Conference on Data Engineering (ICDE'11). 840--851. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Donald Kossmann. 2000. The state of the art in distributed query processing. ACM Comput. Surv. 32, 422--469. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Chuan Lei, Elke A. Rundensteiner, and Joshua D. Guttman. 2013. Robust distributed stream processing. In Proceedings of the International Conference on Data Engineering (ICDE'13). 817--828. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Richard J. Lipton, Jeffrey F. Naughton, Donovan A. Schneider, and Sridhar Seshadri. 1993. Efficient sampling strategies for relational database operations. In Proceedings of the Selected Papers of the 4th International Conference on Database Theory (ICDT'93). Elsevier Science Publishers Ltd., 195--226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Andrea Lodi, Silvano Martello, and Michele Monaci. 2002. Two-dimensional packing problems: A survey. Euro. J. Oper. Res. 141, 2, 241--252.Google ScholarGoogle ScholarCross RefCross Ref
  28. Volker Markl, Vijayshankar Raman, David Simmen, Guy Lohman, and Hamid Pirahesh. 2004. Robust query processing through progressive optimization. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'04). 659--670. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Thomas M. Mitchell. 1997. Machine Learning 1st Ed. McGraw-Hill, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Rajeev Motwani, Jennifer Widom, Arvind Arasu, Brian Babcock, Shivnath Babu, et al. 2003. Query processing, approximation, and resource management in a data stream management system. In Proceedings of the Conference on Innovative Data Systems Research (CIDR'03). 245--256.Google ScholarGoogle Scholar
  31. Rimma V. Nehme, Elke A. Rundensteiner, and Elisa Bertino. 2009. Self-tuning query mesh for adaptive multi-route query processing. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology (EDBT'09). 803--814. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Athanasios Papoulis. 1984. Probability, Random Variables, and Stochastic Processes. McGraw Hill.Google ScholarGoogle Scholar
  33. Liping Peng, Yanlei Diao, and Anna Liu. 2011. Optimizing probabilistic query processing on continuous uncertain data. Proc. VLDB Endow. 4, 11, 1169--1180.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Naveen Reddy and Jayant R. Haritsa. 2005. Analyzing plan diagrams of database query optimizers. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB'05). 1228--1239. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Mehul A. Shah, Joseph M. Hellerstein, Sirish Chandrasekaran, and Michael J. Franklin. 2003. Flux: An adaptive partitioning operator for continuous query systems. In Proceedings of the 19th International Conference on Data Engineering (ICDE'03). 25--36.Google ScholarGoogle Scholar
  36. Behrooz A. Shirazi, Krishna M. Kavi, and Ali R. Hurson, Eds. 1995. Scheduling and Load Balancing in Parallel and Distributed Systems. IEEE Computer Society Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Timothy M. Sutherland, Bin Liu, Mariana Jbantova, and Elke A. Rundensteiner. 2005. D-CAPE: Distributed and self-tuned continuous query processing. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM'05). 217--218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Feng Tian and David J. DeWitt. 2003. Tuple routing strategies for distributed eddies. In Proceedings of the 29th International Conference on Very Large Data Bases (VLDB'03), vol. 29. 333--344. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. TradingMarkets. 2013. http://www.tradingmarkets.com/.Google ScholarGoogle Scholar
  40. Haixun Wang and Jian Pei. 2005. A random method for quantifying changing distributions in data streams. In Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'05). 684--691. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Joel Wolf, Nikhil Bansal, Kirsten Hildrum, Sujay Parekh, Deepak Rajan, Rohit Wagle, Kun-Lung Wu, and Lisa Fleischer. 2008. SODA: An optimizing scheduler for large-scale stream-based distributed computer systems. In Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware (Middleware'08). 306--325. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Ying Xing, Jeong-Hyon Hwang, Ugur Cetintemel, and Stan Zdonik. 2006. Providing resiliency to load variations in distributed stream processing. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB'06). 775--786. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Ying Xing, Stan Zdonik, and Jeong-Hyon Hwang. 2005. Dynamic load distribution in the borealis stream processor. In Proceedings of the 21st International Conference on Data Engineering (ICDE'05). 791--802. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Yali Zhu, Elke A. Rundensteiner, and George T. Heineman. 2004. Dynamic plan migration for continuous queries over data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'04). 431--442. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Robust Distributed Query Processing for Streaming Data

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Database Systems
      ACM Transactions on Database Systems  Volume 39, Issue 2
      May 2014
      336 pages
      ISSN:0362-5915
      EISSN:1557-4644
      DOI:10.1145/2627748
      Issue’s Table of Contents

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 26 May 2014
      • Accepted: 1 March 2014
      • Revised: 1 February 2014
      • Received: 1 April 2013
      Published in tods Volume 39, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader