research-article

Open Access

PrivBayes: Private Data Release via Bayesian Networks

Authors:
Jun Zhang

Google Inc., China

Google Inc., China
View Profile

,
Graham Cormode

University of Warwick, UK

University of Warwick, UK
View Profile

,
Cecilia M. Procopiuc

Google Inc., USA

Google Inc., USA
View Profile

,
Divesh Srivastava

AT8T Labs--Research, USA

AT8T Labs--Research, USA
View Profile

,
Xiaokui Xiao

Nanyang Technological University, Singapore

Nanyang Technological University, Singapore
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 42 Issue 4Article No.: 25pp 1–41https://doi.org/10.1145/3134428

Published:27 October 2017Publication History

ACM Transactions on Database Systems

Abstract

Privacy-preserving data publishing is an important problem that has been the focus of extensive study. The state-of-the-art solution for this problem is differential privacy, which offers a strong degree of privacy protection without making restrictive assumptions about the adversary. Existing techniques using differential privacy, however, cannot effectively handle the publication of high-dimensional data. In particular, when the input dataset contains a large number of attributes, existing methods require injecting a prohibitive amount of noise compared to the signal in the data, which renders the published data next to useless.

To address the deficiency of the existing methods, this paper presents PrivBayes, a differentially private method for releasing high-dimensional data. Given a dataset D, PrivBayes first constructs a Bayesian network N, which (i) provides a succinct model of the correlations among the attributes in D and (ii) allows us to approximate the distribution of data in D using a set P of low-dimensional marginals of D. After that, PrivBayes injects noise into each marginal in P to ensure differential privacy and then uses the noisy marginals and the Bayesian network to construct an approximation of the data distribution in D. Finally, PrivBayes samples tuples from the approximate distribution to construct a synthetic dataset, and then releases the synthetic data. Intuitively, PrivBayes circumvents the curse of dimensionality, as it injects noise into the low-dimensional marginals in P instead of the high-dimensional dataset D. Private construction of Bayesian networks turns out to be significantly challenging, and we introduce a novel approach that uses a surrogate function for mutual information to build the model more accurately. We experimentally evaluate PrivBayes on real data and demonstrate that it significantly outperforms existing solutions in terms of accuracy.

Supplemental Material

Available for Download

zip

zhang.zip (159.3 KB)

Supplemental movie, appendix, image and software files for, PrivBayes: Private Data Release via Bayesian Networks

References

Kevin Bache and Moshe Lichman. 2013. UCI Machine Learning Repository (2013). Retrieved from http://archive.ics.uci.edu/ml.Google Scholar
Boaz Barak, Kamalika Chaudhuri, Cynthia Dwork, Satyen Kale, Frank McSherry, and Kunal Talwar. 2007. Privacy, accuracy, and consistency too: A holistic solution to contingency table release. In Proceedings of PODS. 273--282. Google ScholarDigital Library
Roberto J. Bayardo and Rakesh Agrawal. 2005. Data privacy through optimal k-anonymization. In Proceedings of ICDE. 217--228. Google ScholarDigital Library
Raghav Bhaskar, Srivatsan Laxman, Adam Smith, and Abhradeep Thakurta. 2010. Discovering frequent patterns in sensitive data. In Proceedings of KDD. 503--512. Google ScholarDigital Library
Hayes Brian. 2002. Computing science: The easiest hard problem. American Scientist 90, 2 (2002), 113--117. Google ScholarCross Ref
Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM TIST (2011), 27.Google ScholarDigital Library
Kamalika Chaudhuri and Claire Monteleoni. 2008. Privacy-preserving logistic regression. In Proceedings of NIPS. 289--296.Google Scholar
Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. 2011. Differentially private empirical risk minimization. J. Mach. Learn. Res. 12 (2011), 1069--1109.Google ScholarDigital Library
Rui Chen, Qian Xiao, Yu Zhang, and Jianliang Xu. 2015. Differentially private high-dimensional data publication via sampling-based inference. In Proceedings of SIGKDD. 129--138. Google ScholarDigital Library
Yan Chen and Ashwin Machanavajjhala. 2015. On the privacy properties of variants on the sparse vector technique. CoRR abs/1508.07306 (2015).Google Scholar
David Maxwell Chickering, David Heckerman, and Christopher Meek. 2004. Large-sample learning of bayesian networks is NP-hard. J. Mach. Learn. Res. 5 (2004), 1287--1330.Google ScholarCross Ref
C. K. Chow and C. N. Liu. 1968. Approximating discrete probability distributions with dependence trees. IEEE Trans. Info. Theory 14 (1968), 462--467. Google ScholarDigital Library
CIA. 2015. The World Factbook 2014--15. Government Printing Office.Google Scholar
Graham Cormode, Cecilia Magdalena Procopiuc, Entong Shen, Divesh Srivastava, and Ting Yu. 2012. Differentially private spatial decompositions. In Proceedings of ICDE. Google ScholarDigital Library
Graham Cormode, Cecilia Magdalena Procopiuc, Divesh Srivastava, and Thanh T. L. Tran. 2012. Differentially private publication of sparse data. In Proceedings of ICDT. Google ScholarDigital Library
Aleksandr B. Cybakov. 2009. Introduction to Nonparametric Estimation. Springer. Google ScholarCross Ref
Bolin Ding, Marianne Winslett, Jiawei Han, and Zhenhui Li. 2011. Differentially private data cubes: Optimizing noise sources and consistency. In Proceedings of SIGMOD. 217--228. Google ScholarDigital Library
Cynthia Dwork. 2006. Differential privacy. In Proceedings of ICALP. 1--12. Google ScholarDigital Library
Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Proceedings of TCC. 265--284. Google ScholarDigital Library
Cynthia Dwork and Aaron Roth. 2013. The algorithmic foundations of differential privacy. Theor. Comput. Sci. 9, 3--4 (2013), 211--407.Google Scholar
Dan Feldman, Amos Fiat, Haim Kaplan, and Kobbi Nissim. 2009. Private coresets. In Proceedings of STOC. 361--370. Google ScholarDigital Library
Arik Friedman and Assaf Schuster. 2010. Data mining with differential privacy. In Proceedings of KDD. 493--502. Google ScholarDigital Library
Marco Gaboardi, Emilio Jesús Gallego Arias, Justin Hsu, Aaron Roth, and Zhiwei Steven Wu. 2014. Dual query: Practical private query release for high dimensional data. In Proceedings of ICML. 1170--1178.Google Scholar
F. Gray. 1953. Pulse code communication (March 17 1953). Retrieved from https://www.google.com/patents/US2632058. U.S. Patent 2,632,058.Google Scholar
Moritz Hardt. 2011. A Study of Privacy and Fairness in Sensitive Data Analysis. Ph.D. Dissertation. Princeton University.Google Scholar
Moritz Hardt, Katrina Ligett, and Frank McSherry. 2012. A simple and practical algorithm for differentially private data release. In Proceedings of NIPS. 2348--2356.Google Scholar
Michael Hay, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. 2010. Boosting the accuracy of differentially private histograms through consistency. PVLDB 3, 1 (2010), 1021--1032. Google ScholarDigital Library
Vijay S. Iyengar. 2002. Transforming data to satisfy privacy constraints. In Proceedings of IGKDD. 279--288. Google ScholarDigital Library
Daniel Kifer, Adam D. Smith, and Abhradeep Thakurta. 2012. Private convex optimization for empirical risk minimization with applications to high-dimensional regression. J. Mach. Learn. Res. Proc. Track 23 (2012), 25.1--25.40.Google Scholar
Daphne Koller and Nir Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press.Google Scholar
Chao Li, Michael Hay, Vibhor Rastogi, Gerome Miklau, and Andrew McGregor. 2010. Optimizing linear counting queries under differential privacy. In Proceedings of PODS. 123--134. Google ScholarDigital Library
Chao Li and Gerome Miklau. 2012. An adaptive mechanism for accurate query answering under differential privacy. PVLDB 5, 6 (2012), 514--525. Google ScholarDigital Library
Chao Li and Gerome Miklau. 2013. Optimal error of query sets under the differentially-private matrix mechanism. In Proceedings of ICDT. 272--283. Google ScholarDigital Library
Ninghui Li, Wahbeh Qardaji, Dong Su, and Jianneng Cao. 2012. PrivBasis: Frequent itemset mining with differential privacy. PVLDB 5, 11 (2012), 1340--1351. Google ScholarDigital Library
Kenneth G. Manton. 2010. National long-term care survey: 1982, 1984, 1989, 1994, 1999, and 2004. (2010).Google Scholar
Dimitris Margaritis. 2003. Learning Bayesian Network Model Structure from Data. Ph.D. Dissertation. School of Computer Science, Carnegie-Mellon University, Pittsburgh, PA.Google Scholar
Frank McSherry and Ratul Mahajan. 2010. Differentially-private network trace analysis. In Proceedings of SIGCOMM. 123--134. Google ScholarDigital Library
Frank McSherry and Ilya Mironov. 2009. Differentially private recommender systems: Building privacy into the netflix prize contenders. In Proceedings of KDD. 627--636. Google ScholarDigital Library
Frank McSherry and Kunal Talwar. 2007. Mechanism design via differential privacy. In Proceedings of FOCS. 94--103. Google ScholarDigital Library
Prashanth Mohan, Abhradeep Thakurta, Elaine Shi, Dawn Song, and David Culler. 2012. GUPT: Privacy preserving data analysis made easy. In Proceedings of SIGMOD. 349--360. Google ScholarDigital Library
Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. 2007. Smooth sensitivity and sampling in private data analysis. In Proceedings of STOC. 75--84. Google ScholarDigital Library
Vibhor Rastogi and Suman Nath. 2010. Differentially private aggregation of distributed time-series with transformation and encryption. In Proceedings of SIGMOD. 735--746. Google ScholarDigital Library
Benjamin I. P. Rubinstein, Peter L. Bartlett, Ling Huang, and Nina Taft. 2012. Learning in a large function space: Privacy-preserving mechanisms for SVM learning. J. Priv. Confident. 4, 1 (2012), 65--100.Google ScholarCross Ref
Steven Ruggles, Katie Genadek, Ronald Goeken, Josiah Grover, and Matthew Sobek. 2015. Integrated Public Use Microdata Series: Version 6.0. (2015). Retrieved from https://international.ipums.org.Google Scholar
Adam Smith. 2011. Privacy-preserving statistical estimation with optimal convergence rate. In Proceedings of STOC. Google ScholarDigital Library
Xiaokui Xiao, Guozhang Wang, and Johannes Gehrke. 2010. Differential privacy via wavelet transforms. In Proceedings of ICDE. 225--236. Google ScholarCross Ref
Grigory Yaroslavtsev, Graham Cormode, Cecilia M. Procopiuc, and Divesh Srivastava. 2013. Accurate and efficient private release of datacubes and contingency tables. In Proceedings of ICDE. 745--756. Google ScholarDigital Library
Ganzhao Yuan, Zhenjie Zhang, Marianne Winslett, Xiaokui Xiao, Yin Yang, and Zhifeng Hao. 2012. Low-rank mechanism: Optimizing batch queries under differential privacy. PVLDB 5, 11 (2012), 1352--1363. Google ScholarDigital Library
Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, and Xiaokui Xiao. 2014. PrivBayes: Private data release via bayesian networks. In Proceedings of SIGMOD. 1423--1434. Google ScholarDigital Library
Jun Zhang, Xiaokui Xiao, Yin Yang, Zhenjie Zhang, and Marianne Winslett. 2013. PrivGene: Differentially private model fitting using genetic algorithms. In Proceedings of SIGMOD. 665--676. Google ScholarDigital Library

Index Terms

PrivBayes: Private Data Release via Bayesian Networks
1. Security and privacy
  1. Database and storage security
    1. Data anonymization and sanitization

Recommendations

PrivBayes: private data release via bayesian networks
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Privacy-preserving data publishing is an important problem that has been the focus of extensive study. The state-of-the-art goal for this problem is differential privacy, which offers a strong degree of privacy protection without making restrictive ...
Read More
MC-GEN: Multi-level clustering for private synthetic data generation
Abstract
With the development of machine learning and data science, data sharing is very common between companies and research institutes to avoid data scarcity. However, sharing original datasets that contain private information can cause ...
Read More
Deep learning-based privacy-preserving framework for synthetic trajectory generation
Abstract
Synthetic data generation based on state-of-the-art deep learning methods has recently emerged as a promising solution to replace the expensive and laborious collection of real data. Accordingly, several deep learning approaches have ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Database Systems Volume 42, Issue 4
Invited Paper from SIGMOD 2016, Invited Paper from PODS 2016, Invited Paper from ICDT 2016 and Regular Papers
December 2017
241 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/3155316
Editor:
Christian S. Jensen
Aalborg University, Denmark
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 October 2017
- Accepted: 1 August 2017
- Revised: 1 March 2017
- Received: 1 July 2016
Published in tods Volume 42, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Differential privacy
bayesian network
synthetic data generation
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 211
  Total Citations
  View Citations
- 6,293
  Total Downloads
- Downloads (Last 12 months)1,598
- Downloads (Last 6 weeks)277
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

PrivBayes: Private Data Release via Bayesian Networks

ACM Transactions on Database Systems

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

PrivBayes: private data release via bayesian networks

MC-GEN: Multi-level clustering for private synthetic data generation

Deep learning-based privacy-preserving framework for synthetic trajectory generation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

PrivBayes: Private Data Release via Bayesian Networks

ACM Transactions on Database Systems

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

PrivBayes: private data release via bayesian networks

MC-GEN: Multi-level clustering for private synthetic data generation

Deep learning-based privacy-preserving framework for synthetic trajectory generation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media