skip to main content
research-article

Mining order-preserving submatrices from probabilistic matrices

Published:06 January 2014Publication History
Skip Abstract Section

Abstract

Order-preserving submatrices (OPSMs) capture consensus trends over columns shared by rows in a data matrix. Mining OPSM patterns discovers important and interesting local correlations in many real applications, such as those involving biological data or sensor data. The prevalence of uncertain data in various applications, however, poses new challenges for OPSM mining, since data uncertainty must be incorporated into OPSM modeling and the algorithmic aspects.

In this article, we define new probabilistic matrix representations to model uncertain data with continuous distributions. A novel probabilistic order-preserving submatrix (POPSM) model is formalized in order to capture similar local correlations in probabilistic matrices. The POPSM model adopts a new probabilistic support measure that evaluates the extent to which a row belongs to a POPSM pattern. Due to the intrinsic high computational complexity of the POPSM mining problem, we utilize the anti-monotonic property of the probabilistic support measure and propose an efficient Apriori-based mining framework called ProbApri to mine POPSM patterns. The framework consists of two mining methods, UniApri and NormApri, which are developed for mining POPSM patterns, respectively, from two representative types of probabilistic matrices, the UniDist matrix (assuming uniform data distributions) and the NormDist matrix (assuming normal data distributions). We show that the NormApri method is practical enough for mining POPSM patterns from probabilistic matrices that model more general data distributions.

We demonstrate the superiority of our approach by two applications. First, we use two biological datasets to illustrate that the POPSM model better captures the characteristics of the expression levels of biologically correlated genes and greatly promotes the discovery of patterns with high biological significance. Our result is significantly better than the counterpart OPSMRM (OPSM with repeated measurement) model which adopts a set-valued matrix representation to capture data uncertainty. Second, we run the experiments on an RFID trace dataset and show that our POPSM model is effective and efficient in capturing the common visiting subroutes among users.

References

  1. Charu C. Aggarwal, Yan Li, Jianyong Wang, and Jin Wang. 2009. Frequent pattern mining with uncertain data. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'09). ACM, New York, NY, 29--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Charu C. Aggarwal, Joel L. Wolf, Philip S. Yu, Cecilia Procopiuc, and Jong Soo Park. 1999. Fast algorithms for projected clustering. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'99). ACM, New York, NY, 61--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. 1998. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'98). ACM, New York, NY, 94--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Rakesh Agrawal and Ramakrishnan Srikant. 1995. Mining sequential patterns. In Proceedings of the IEEE 11th International Conference on Data Engineering (ICDE'95). IEEE Computer Society, Los Alamitos, CA, 3--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Mohammad Ahsanullah, Valery Nevzorov, and Mohammad Shakil. 2013. An Introduction to Order Statistics. Atlantis Studies in Probability and Statistics, Vol. 3, Atlantis Press.Google ScholarGoogle Scholar
  6. Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, Srujana Merugu, and Dharmendra S. Modha. 2004. A generalized maximum entropy approach to Bregman co-clustering and matrix approximation. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'04). ACM, New York, NY, 509--514. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Amir Ben-Dor, Benny Chor, Richard Karp, and Zohar Yakhini. 2002. Discovering local structure in gene expression data: The order-preserving submatrix problem. In Proceedings of the 6th Annual International Conference on Research in Computational Molecular Biology. 49--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Stanislav Busygin, Gerrit Jacobsen, Ewald Kramer, and Contentsoft Ag. 2002. Double conjugated clustering applied to leukemia microarray data. In Proceedings of the 2nd SIAM ICDM Workshop on Clustering High Dimensional Data. SIAM, Philadelphia, PA.Google ScholarGoogle Scholar
  9. Yizong Cheng and George M. Church. 2000. Biclustering of expression data. In Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology. 93--103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Lin Cheung, Kevin Y. Yip, David W. Cheung, Ben Kao, and Michael K. Ng. 2007. On mining micro-array data by order-preserving submatrix. Int. J. Bioinform. Res. Appl. 3, 1 (2007), 42--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Burton Kuan Hui Chia and R. Krishna Murthy Karuturi. 2010. Differential co-expression framework to quantify goodness of biclusters and compare biclustering algorithms. Algor. Molecular Bio. 5, 23 (2010).Google ScholarGoogle Scholar
  12. Hyuk Cho, Inderjit S. Dhillon, Yuqiang Guan, and Suvrit Sra. 2004. Minimum sum-squared residue co-clustering of gene expression data. In Proceedings of the SIAM International Conference on Data Mining (SDM'04). SIAM, Philadelphia, PA, 114--125.Google ScholarGoogle ScholarCross RefCross Ref
  13. Chun Kit Chui, Ben Kao, Kevin Y. Yip, and Sau Dan Lee. 2008. Mining order-preserving submatrices from data with repeated measurements. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM'08). IEEE Computer Society, Los Alamitos, CA, 133--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Srivatsava Daruru, Nena Marín, Matt Walker, and Joydeep Ghosh. 2009. Pervasive parallelism in data mining: Dataflow solution to co-clustering large and sparse netflix data. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'09). ACM, New York, NY, 1115--1123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Inderjit S. Dhillon, Subramanyam Mallela, and Dharmendra S. Modha. 2003. Information-theoretic co-clustering. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03). ACM, New York, NY, 89--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Chris Ding, Tao Li, Wei Peng, and Haesun Park. 2006. Orthogonal nonnegative matrix t-factorizations for clustering. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06). ACM, New York, NY, 126--135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Qiong Fang, Wilfred Ng, and Jianlin Feng. 2010. Discovering significant relaxed order-preserving submatrices. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'10). ACM, New York, NY, 433--442. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Qiong Fang, Wilfred Ng, Jianlin Feng, and Yuliang Li. 2012. Mining bucket order-preserving submatrices in gene expression data. IEEE Trans. Knowl. Data Eng. 24, 12 (2012), 2218--2231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Byron J. Gao, Obi L. Griffith, Martin Ester, and Steven J. M. Jones. 2006. Discovering significant OPSM subspace clusters in massive gene expression data. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06). ACM, New York, NY, 922--928. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Byron J. Gao, Obi L. Griffith, Martin Ester, Hui Xiong, Qiang Zhao, and Steven J. M. Jones. 2012. On the deep order-preserving submatrix problem: A best effort approach. IEEE Trans. Knowl. Data Eng. 24, 2 (2012), 309--325. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Gad Getz, Erel Levine, and Eytan Domany. 2000. Coupled two-way clustering analysis of gene microarray data. Proc. Nat. Aca. Sci. 97, 22 (2000), 12079--12084.Google ScholarGoogle ScholarCross RefCross Ref
  22. Tao Gu, Liang Wang, Zhanqing Wu, Xianping Tao, and Jian Lu. 2011. A pattern mining approach to sensor-based human activity recognition. IEEE Trans. Knowl. Data Eng. 23, 9 (2011), 1359--1372. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Stephan Günnemann, Ines Färber, Kittipat Virochsiri, and Thomas Seidl. 2012. Subspace correlation clustering: Finding locally correlated dimensions in subspace projections of the data. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'12). ACM, New York, NY, 352--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Neelima Gupta and Seema Aggarwal. 2010. MIB: Using mutual information for biclustering gene expression data. Pattern Recog. 43, 8 (2010), 2692--2697. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Rohit Gupta, Navneet Rao, and Vipin Kumar. 2010. Discovery of error-tolerant biclusters from noisy gene expression data. In Proceedings of the 9th International Workshop on Data Mining in Bioinformatics (BIOKDD'10). ACM, New York, NY.Google ScholarGoogle Scholar
  26. J. A. Hartigan. 1972. Direct clustering of a data matrix. J. Am. Stat. Asso. 67, 337 (1972).Google ScholarGoogle ScholarCross RefCross Ref
  27. Michael T. Heath. 2002. Scientific Computing: An Introductory Survey. McGraw-Hill Higher Education. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Timothy R. Hughes, Matthew J. Marton, Allan R. Jones, et al. 2000. Functional discovery via a compendium of expression profiles. Cell 102, (2000), 1, 109--126.Google ScholarGoogle Scholar
  29. Jens Humrich, Thomas Gartner, and Gemma C. Garriga. 2011. A fixed parameter tractable integer program for finding the maximum order preserving submatrix. In Proceedings of the 11th IEEE International Conference on Data Mining (ICDM'11). IEEE Computer Society, Los Alamitos, CA, 1098--1103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Trey Ideker, Vesteinn Thorsson, Jeffrey A. Ranish, R. Christmas, J. Bunler, J. Eng, R. Bumgarner, D. Goodlett, R. Aebersold, and L. Hood. 2001. Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science 292, 5518 (2001), 929--934.Google ScholarGoogle Scholar
  31. Jan Ihmels, Gilgi Friedlander, Sven Bergmann, Ofer Sarig, Yaniv Ziv, and Naama Barkai. 2002. Revealing modular organization in the yeast transcriptional network. Nature Genetics 31, 4 (2002), 370--377.Google ScholarGoogle ScholarCross RefCross Ref
  32. Shuiwang Ji, Wenlu Zhang, and Jun Liu. 2012. A sparsity-inducing formulation for evolutionary co-clustering. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'12). ACM, New York, NY, 334--342. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Karin Kailing, Hans-Peter Kriegel, and Peer Kröger. 2004. Density-connected subspace clustering for high-dimensional data. In Proceedings of the 4th SIAM International Conference on Data Mining (SDM'04). SIAM, Philadelphia, PA, 246--257.Google ScholarGoogle ScholarCross RefCross Ref
  34. Hans-Peter Kriegel, Peer Kröger, and Arthur Zimek. 2009. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data 3, 1 (2009), 1--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Hye-Chung Kum, Jian Pei, Wei Wang, and Dean Duncan. 2003. ApproxMAP: Approximate mining of consensus sequential patterns. In Proceedings of the 3rd SIAM International Conference on Data Mining (SDM'02). SIAM, Philadelphia, PA, 311--315.Google ScholarGoogle ScholarCross RefCross Ref
  36. Mei-Ling Ting Lee, Frank C. Kuo, G. A. Whitmorei, and Jeffrey Sklar. 2000. Importance of replication in microarray gene expression studies: Statistical methods and evidence from repetitive cDNA hybridizations. Proc. Natl. Acad. Sci. 97, 18 (2000), 9834--9839.Google ScholarGoogle ScholarCross RefCross Ref
  37. Guojun Li, Qin Ma, Haibao Tang, Andrew Paterson, and Ying Xu. 2009. QUBIC: A qualitative biclustering algorithm for analyses of gene expression data. Nucleic Acids Res. 37, 15 (2009), e101.Google ScholarGoogle ScholarCross RefCross Ref
  38. Jian Li and Amol Deshpande. 2010. Ranking continuous probabilistic datasets. Proc. VLDB Endow. 3, 1 (2010), 638--649. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Jinze Liu and Wei Wang. 2003. OP-Cluster: Clustering by tendency in high dimensional space. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM'03). IEEE Computer Society, Los Alamitos, CA, 187--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Bo Long, Zhongfei Zhang, and Philip S. Yu. 2005. Co-clustering by block value decomposition. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'05). ACM, New York, NY, 635--640. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. S. C. Madeira, M. C. Teixeira, I. Sa-Correia, and A. L. Oliveira. 2010. Identification of regulatory modules in time series gene expression data using a linear time biclustering algorithm. IEEE/ACM Trans. Comput. Bio. Bioinform. 7, 1 (2010), 153--165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Sara C. Madeira and Arlindo L. Oliveira. 2004. Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Trans. Computat. Biol. Bioinform. 1, 1 (2004), 24--45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Gabriela Moise and Jörg Sander. 2008. Finding non-redundant, statistically significant regions in high dimensional data: A novel approach to projected and subspace clustering. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'08). ACM, New York, NY, 533--541. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. T. M. Murali and S Kasif. 2003. Extracting conserved gene expression motifs from gene expression data. In Proceedings of the Pacific Symposium on Biocomputing. 77--88.Google ScholarGoogle Scholar
  45. Muhammad Muzammal and Rajeev Raman. 2011. Mining sequential patterns from probabilistic databases. In Proceedings of the 15th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD'11). Lecture Notes in Computer Science, vol. 6635. Springer-Verlag, Berlin, 210--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Tung T. Nguyen, Richard R. Almon, Debra C. DuBois, William J Jusko, and Ioannis P Androulakis. 2010. Importance of replication in analyzing time-series gene expression data: Corticosteroid dynamics and circadian patterns in rat liver. BMC Bioinform. 11, 279 (2010).Google ScholarGoogle Scholar
  47. Feng Pan, Xiang Zhang, and Wei Wang. 2008. CRD: Fast co-clustering on large datasets utilizing sampling-based matrix decomposition. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'08). ACM, New York, NY, 173--184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Gaurav Pandey, Gowtham Atluri, Michael Steinbach, Chad L. Myers, and Vipin Kumar. 2009. An association analysis approach to biclustering. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'09). ACM, New York, NY, 677--686. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Lance Parsons, Ehtesham Haque, and Huan Liu. 2004. Subspace clustering for high dimensional data: A review. SIGKDD Explore Newsl. 6, 1, 90--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. 2001. Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. In Proceedings of the IEEE 17th International Conference on Data Engineering (ICDE'01). IEEE Computer Society, Los Alamitos, CA, 215--224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Jianyong Wang, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. 2004. Mining sequential patterns by pattern-growth: The prefixspan approach. IEEE Trans. Knowl. Data Eng. 16, 11, 1424--1440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Beatriz Pontes, Federico Divina, Raúl Giráldez, and J. S. Aguilar-Ruiz. 2010. Improved biclustering on expression data through overlapping control. Int. J. Intell. Comput. Cybernet. 3 (2010), 293--309.Google ScholarGoogle Scholar
  53. Amela Prelić, Stefan Bleuler, Philip Zimmermann, Anja Wille, P. Bünlmann, W. Gruissem, L. Hennig, L. Thiele, and E. Zitzler. 2006. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22, 9 (2006), 1122--1129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Parisa Rashidi, Diane J. Cook, Lawrence B. Holder, and Maureen Schmitter-Edgecombe. 2011. Discovering activities to recognize and track in a smart environment. IEEE Trans. Knowl. Data Eng. 23, 4 (2011), 527--539. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Christopher Ré, Julie Letchner, Magdalena Balazinksa, and Dan Suciu. 2008. Event queries on correlated probabilistic streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'08). ACM, New York, NY, 715--728. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Chris Seidel. 2008. Introduction to DNA Microarrays. Wiley-VCH Verlag GmbH & Co. KGaA, 1--26.Google ScholarGoogle Scholar
  57. Mohamed A. Soliman and Ihab F. Ilyas. 2009. Ranking with uncertain scores. In Proceedings of the IEEE 25th International Conference on Data Engineering (ICDE'09). IEEE Computer Society, Los Alamitos, CA, 317--328. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Ramakrishnan Srikant and Rakesh Agrawal. 1996. Mining sequential patterns: Generalizations and performance improvements. In Proceedings of the International Conference on Extending Database Technology (EDBT'96). ACM, New York, NY, 3--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Amos Tanay, Roded Sharan, and Ron Shamir. 2002. Discovering statistically significant biclusters in gene expression data. Bioinformatics 18 (2002), 136--144.Google ScholarGoogle ScholarCross RefCross Ref
  60. Andrew C. Trapp and Oleg A. Prokopyev. 2010. Solving the order-preserving submatrix problem via integer programming. INFORMS J. Comput. 22, 3 (July 2010), 387--400. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Hua Wang, Feiping Nie, Heng Huang, and Chris Ding. 2011. Nonnegative matrix tri-factorization based high-order co-clustering and its fast implementation. In Proceedings of the 11th IEEE International Conference on Data Mining (ICDM'11). IEEE Computer Society, Los Alamitos, CA, 774--783. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Evan Welbourne, Nodira Khoussainova, Julie Letchner, Yang Li, Magdalena Balazinska, Gaetano Borriello, and Dan Suciu. 2008. Cascadia: A system for specifying, detecting, and managing RFID events. In Proceedings of the 6th International Conference on Mobile Systems, Applications, and Services (MobiSys'08). ACM, New York, NY, 281--294. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Ka Yee Yeung, Mario Medvedovic, and Roger Bumgarner. 2003. Clustering gene-expression data with repeated measurements. Gen. Biol. 4, 5 (2003).Google ScholarGoogle Scholar
  64. Kevin Y. Yip, Ben Kao, Xinjie Zhu, Chun Kit Chui, Sau Dan Lee, and David W. Cheung. 2013. Mining order-preserving submatrices from data with repeated measurements. IEEE Trans. Knowl. Data Eng. 25, 7 (2013), 1587--1600. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Mengsheng Zhang, Wei Wang, and Jinze Liu. 2008. Mining approximate order preserving clusters in the presence of noise. In Proceedings of the IEEE 24th International Conference on Data Engineering (ICDE'08). IEEE Computer Society, Los Alamitos, CA, 160--168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Zhou Zhao, Da Yan, and Wilfred Ng. 2012. Mining probabilistically frequent sequential patterns in uncertain databases. In Proceedings of the 15th International Conference on Extending Database Technology (EDBT'12). ACM, New York, NY, 74--85. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mining order-preserving submatrices from probabilistic matrices

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Database Systems
      ACM Transactions on Database Systems  Volume 39, Issue 1
      January 2014
      317 pages
      ISSN:0362-5915
      EISSN:1557-4644
      DOI:10.1145/2576988
      Issue’s Table of Contents

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 January 2014
      • Accepted: 1 October 2013
      • Revised: 1 July 2013
      • Received: 1 December 2012
      Published in tods Volume 39, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader