Abstract
We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two datasets, there is a way to merge the two summaries into a single summary on the two datasets combined together, while preserving the error and size guarantees. This property means that the summaries can be merged in a way akin to other algebraic operators such as sum and max, which is especially useful for computing summaries on massive distributed data. Several data summaries are trivially mergeable by construction, most notably all the sketches that are linear functions of the datasets. But some other fundamental ones, like those for heavy hitters and quantiles, are not (known to be) mergeable. In this article, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modifications. Specifically, we show that for ε-approximate heavy hitters, there is a deterministic mergeable summary of size O(1/ε); for ε-approximate quantiles, there is a deterministic summary of size O((1/ε) log(ε n)) that has a restricted form of mergeability, and a randomized one of size O((1/ε) log3/2(1/ε)) with full mergeability. We also extend our results to geometric summaries such as ε-approximations which permit approximate multidimensional range counting queries. While most of the results in this article are theoretical in nature, some of the algorithms are actually very simple and even perform better than the previously best known algorithms, which we demonstrate through experiments in a simulated sensor network.
We also achieve two results of independent interest: (1) we provide the best known randomized streaming bound for ε-approximate quantiles that depends only on ε, of size O((1/ε) log3/2(1/ε)), and (2) we demonstrate that the MG and the SpaceSaving summaries for heavy hitters are isomorphic.
- Agarwal, P. K., Cormode, G., Huang, Z., Phillips, J. M., Wei, Z., and Yi, K. 2012. Mergeable summaries. In Proceedings of the 31st ACM Symposium on Principals of Database Systems. 23--34. Google ScholarDigital Library
- Ahn, K. J., Guha, S., and McGregor, A. 2012. Analyzing graph structure via linear measurements. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. Google ScholarDigital Library
- Alon, N., Matias, Y., and Szegedy, M. 1999. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58, 1, 137--147. Google ScholarDigital Library
- Bansal, N. 2010. Constructive algorithms for discrepancy minimization. In Proceedings of the IEEE Symposium on Foundations of Computer Science. 3--10. Google ScholarDigital Library
- Bansal, N. 2012. Semidefinite optimization in discrepancy theory. Math. Program. 134, 1, 5--22. Google ScholarDigital Library
- Bar-Yossef, Z., Jayram, T. S., Kumar, R., Sivakumar, D., and Trevisan, L. 2002. Counting distinct elements in a data stream. In Proceedings of the 6th International Workshop on Randomization and Approximation Techniques in Computer Science (RandOM'02). 1--10. Google ScholarDigital Library
- Berinde, R., Cormode, G., Indyk, P., and Strauss, M. 2010. Space-optimal heavy hitters with strong error bounds. ACM Trans. Datab. Syst. 35, 4. Google ScholarDigital Library
- Chazelle, B. 2000. The Discrepancy Method: Randomness and Complexity. Cambridge University Press. Google ScholarDigital Library
- Chazelle, B. and Matousek, J. 1996. On linear-time deterministic algorithms for optimization problems in fixed dimension. J. Algor. 21, 3, 579--597. Google ScholarDigital Library
- Cormode, G. and Hadjieleftheriou, M. 2008a. Finding frequent items in data streams. Proc. VLDB Endow. 1, 2, 1530--1541. Google ScholarDigital Library
- Cormode, G. and Hadjieleftheriou, M. 2008b. Finding frequent items in data streams. In Proceedings of the International Conference on Very Large Data Bases. Google ScholarDigital Library
- Cormode, G. and Muthukrishnan, S. 2005. An improved data stream summary: The count-min sketch and its applications. J. Algor. 55, 1, 58--75. Google ScholarDigital Library
- Feigenbaum, J., Kannan, S., Strauss, M. J., and Viswanathan, M. 2003. An approximate l1-difference algorithm for massive data streams. SIAM J. Comput. 32, 1, 131--151. Google ScholarDigital Library
- Feldman, J., Muthukrishnan, S., Sidiropoulos, A., Stein, C., and Svitkina, Z. 2008. On distributing symmetric streaming computations. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. Google ScholarDigital Library
- Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. J. 2002. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of the International Conference on Very Large Data Bases. Google ScholarDigital Library
- Greenwald, M. and Khanna, S. 2001. Space-efficient online computation of quantile summaries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Greenwald, M. and Khanna, S. 2004. Power conserving computation of order-statistics over sensor networks. In Proceedings of the ACM Symposium on Principles of Database Systems. Google ScholarDigital Library
- Guha, S. 2009. Tight results for clustering and summarizing data streams. In Proceedings of the International Conference on Database Theory. ACM Press, New York, 268--275. Google ScholarDigital Library
- Guha, S., Mishra, N., Motwani, R., and O'Callaghan, L. 2000. Clustering data streams. In Proceedings of the IEEE Symposium on Foundations of Computer Science. 359--366. Google ScholarDigital Library
- Huang, Z., Wang, L., Yi, K., and Liu, Y. 2011. Sampling based algorithms for quantile computation in sensor networks. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Indyk, P. 2006. Stable distributions, pseudorandom generators, embeddings, and data stream computation. J. ACM 53, 307--323. Google ScholarDigital Library
- Kane, D. M., Nelson, J., Porat, E., and Woodruff, D. P. 2011. Fast moment estimation in data streams in optimal space. In Proceedings of the 43rd Annual ACM Symposium on Theory of Computing. Google ScholarDigital Library
- Larsen, K. 2011. On range searching in the group model and combinatorial discrepancy. In Proceedings of the IEEE Symposium on Foundations of Computer Science. 542--549. Google ScholarDigital Library
- Li, Y., Long, P., and Srinivasan, A. 2001. Improved bounds on the sample complexity of learning. J. Comput. Syst. Sci. 62, 3, 516--527. Google ScholarDigital Library
- Lovett, S. and Meka, R. 2012. Constructive discrepancy minimization by walking on the edges. In Proceedings of the 53rd Annual IEEE Symposium on Foundations of Computer Science. Google ScholarDigital Library
- Madden, S., Franklin, M. J., Hellerstein, J. M., and Hong, W. 2002. TAG: A tiny aggregation service for ad-hoc sensor networks. In Proceedings of the Symposium on Operating Systems Design and Implementation. Google ScholarDigital Library
- Manjhi, A., Nath, S., and Gibbons, P. B. 2005a. Tributaries and deltas: Efficient and robust aggregation in sensor network streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Manjhi, A., Shkapenyuk, V., Dhamdhere, K., and Olston, C. 2005b. Finding (recently) frequent items in distributed data streams. In Proceedings of the IEEE International Conference on Data Engineering. Google ScholarDigital Library
- Manku, G. S., Rajagopalan, S., and Lindsay, B. G. 1998. Approximate medians and other quantiles in one pass and with limited memory. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Matousek, J. 1991. Approximations and optimal geometric divide-and-conquer. In Proceedings of the ACM Symposium on Theory of Computing. ACM Press, New York, 505--511. Google ScholarDigital Library
- Matousek, J. 1995. Tight upper bounds for the discrepancy of half-spaces. Discr. Comput. Geom. 13, 593--601.Google ScholarDigital Library
- Matousek, J. 2010. Geometric Discrepancy: An Illustrated Guide, vol. 18. Springer http://bookshelf.theopensourcelibrary.org/2010_CharlesUniversity_GeometricDiscrepancy.pdf.Google Scholar
- Metwally, A., Agrawal, D., and Abbadi, A. 2006. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans. Datab. Syst. 31, 3, 1095--1133. Google ScholarDigital Library
- Misra, J. and Gries, D. 1982. Finding repeated elements. Sci. Comput. Program. 2, 2, 143--152.Google ScholarCross Ref
- Nelson, J., Nguyen, H. L., and Woodruff, D. P. 2012. On deterministic sketching and streaming for sparse recovery and norm estimation. In Proceedings of the 16th International Workshop on Randomization and Computation (RandOM'12).Google Scholar
- Phillips, J. 2008. Algorithms for approximations of terrains. In Proceedings of the 35th International Colloquium on Automata, Languages and Programming (ICALP'08). 447--458. Google ScholarDigital Library
- Shrivastava, N., Buragohain, C., Agrawal, D., and Suri, S. 2004. Medians and beyond: New aggregation techniques for sensor networks. In Proceedings of the 2nd International Conference on Embedded Networked Sensor Systems (SenSys'04). 239-249. Google ScholarDigital Library
- Suri, S., Toth, C., and Zhou, Y. 2006. Range counting over multidimensional data streams. Discr. Comput. Geom. 36, 4, 633--655. Google ScholarDigital Library
- Talagrand, M. 1994. Sharper bounds for gaussian and empirical processes. Ann. Probab. 22, 1, 28--76.Google ScholarCross Ref
- Vapnik, V. and Chervonenkis, A. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16, 264--280.Google ScholarCross Ref
Index Terms
- Mergeable summaries
Recommendations
Mergeable summaries
PODS '12: Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database SystemsWe study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a single summary on the union of the two data sets, while preserving the ...
Tight bounds for distributed functional monitoring
STOC '12: Proceedings of the forty-fourth annual ACM symposium on Theory of computingWe resolve several fundamental questions in the area of distributed functional monitoring, initiated by Cormode, Muthukrishnan, and Yi (SODA, 2008), and receiving recent attention. In this model there are k sites each tracking their input streams and ...
Beating CountSketch for heavy hitters in insertion streams
STOC '16: Proceedings of the forty-eighth annual ACM symposium on Theory of ComputingGiven a stream p1, …, pm of items from a universe U, which, without loss of generality we identify with the set of integers {1, 2, …, n}, we consider the problem of returning all ℓ2-heavy hitters, i.e., those items j for which fj ≥ є √F2, where fj is ...
Comments