Abstract
Order-preserving submatrices (OPSMs) capture consensus trends over columns shared by rows in a data matrix. Mining OPSM patterns discovers important and interesting local correlations in many real applications, such as those involving biological data or sensor data. The prevalence of uncertain data in various applications, however, poses new challenges for OPSM mining, since data uncertainty must be incorporated into OPSM modeling and the algorithmic aspects.
In this article, we define new probabilistic matrix representations to model uncertain data with continuous distributions. A novel probabilistic order-preserving submatrix (POPSM) model is formalized in order to capture similar local correlations in probabilistic matrices. The POPSM model adopts a new probabilistic support measure that evaluates the extent to which a row belongs to a POPSM pattern. Due to the intrinsic high computational complexity of the POPSM mining problem, we utilize the anti-monotonic property of the probabilistic support measure and propose an efficient Apriori-based mining framework called ProbApri to mine POPSM patterns. The framework consists of two mining methods, UniApri and NormApri, which are developed for mining POPSM patterns, respectively, from two representative types of probabilistic matrices, the UniDist matrix (assuming uniform data distributions) and the NormDist matrix (assuming normal data distributions). We show that the NormApri method is practical enough for mining POPSM patterns from probabilistic matrices that model more general data distributions.
We demonstrate the superiority of our approach by two applications. First, we use two biological datasets to illustrate that the POPSM model better captures the characteristics of the expression levels of biologically correlated genes and greatly promotes the discovery of patterns with high biological significance. Our result is significantly better than the counterpart OPSMRM (OPSM with repeated measurement) model which adopts a set-valued matrix representation to capture data uncertainty. Second, we run the experiments on an RFID trace dataset and show that our POPSM model is effective and efficient in capturing the common visiting subroutes among users.
- Charu C. Aggarwal, Yan Li, Jianyong Wang, and Jin Wang. 2009. Frequent pattern mining with uncertain data. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'09). ACM, New York, NY, 29--37. Google ScholarDigital Library
- Charu C. Aggarwal, Joel L. Wolf, Philip S. Yu, Cecilia Procopiuc, and Jong Soo Park. 1999. Fast algorithms for projected clustering. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'99). ACM, New York, NY, 61--72. Google ScholarDigital Library
- Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. 1998. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'98). ACM, New York, NY, 94--105. Google ScholarDigital Library
- Rakesh Agrawal and Ramakrishnan Srikant. 1995. Mining sequential patterns. In Proceedings of the IEEE 11th International Conference on Data Engineering (ICDE'95). IEEE Computer Society, Los Alamitos, CA, 3--14. Google ScholarDigital Library
- Mohammad Ahsanullah, Valery Nevzorov, and Mohammad Shakil. 2013. An Introduction to Order Statistics. Atlantis Studies in Probability and Statistics, Vol. 3, Atlantis Press.Google Scholar
- Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, Srujana Merugu, and Dharmendra S. Modha. 2004. A generalized maximum entropy approach to Bregman co-clustering and matrix approximation. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'04). ACM, New York, NY, 509--514. Google ScholarDigital Library
- Amir Ben-Dor, Benny Chor, Richard Karp, and Zohar Yakhini. 2002. Discovering local structure in gene expression data: The order-preserving submatrix problem. In Proceedings of the 6th Annual International Conference on Research in Computational Molecular Biology. 49--57. Google ScholarDigital Library
- Stanislav Busygin, Gerrit Jacobsen, Ewald Kramer, and Contentsoft Ag. 2002. Double conjugated clustering applied to leukemia microarray data. In Proceedings of the 2nd SIAM ICDM Workshop on Clustering High Dimensional Data. SIAM, Philadelphia, PA.Google Scholar
- Yizong Cheng and George M. Church. 2000. Biclustering of expression data. In Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology. 93--103. Google ScholarDigital Library
- Lin Cheung, Kevin Y. Yip, David W. Cheung, Ben Kao, and Michael K. Ng. 2007. On mining micro-array data by order-preserving submatrix. Int. J. Bioinform. Res. Appl. 3, 1 (2007), 42--64. Google ScholarDigital Library
- Burton Kuan Hui Chia and R. Krishna Murthy Karuturi. 2010. Differential co-expression framework to quantify goodness of biclusters and compare biclustering algorithms. Algor. Molecular Bio. 5, 23 (2010).Google Scholar
- Hyuk Cho, Inderjit S. Dhillon, Yuqiang Guan, and Suvrit Sra. 2004. Minimum sum-squared residue co-clustering of gene expression data. In Proceedings of the SIAM International Conference on Data Mining (SDM'04). SIAM, Philadelphia, PA, 114--125.Google ScholarCross Ref
- Chun Kit Chui, Ben Kao, Kevin Y. Yip, and Sau Dan Lee. 2008. Mining order-preserving submatrices from data with repeated measurements. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM'08). IEEE Computer Society, Los Alamitos, CA, 133--142. Google ScholarDigital Library
- Srivatsava Daruru, Nena Marín, Matt Walker, and Joydeep Ghosh. 2009. Pervasive parallelism in data mining: Dataflow solution to co-clustering large and sparse netflix data. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'09). ACM, New York, NY, 1115--1123. Google ScholarDigital Library
- Inderjit S. Dhillon, Subramanyam Mallela, and Dharmendra S. Modha. 2003. Information-theoretic co-clustering. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03). ACM, New York, NY, 89--98. Google ScholarDigital Library
- Chris Ding, Tao Li, Wei Peng, and Haesun Park. 2006. Orthogonal nonnegative matrix t-factorizations for clustering. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06). ACM, New York, NY, 126--135. Google ScholarDigital Library
- Qiong Fang, Wilfred Ng, and Jianlin Feng. 2010. Discovering significant relaxed order-preserving submatrices. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'10). ACM, New York, NY, 433--442. Google ScholarDigital Library
- Qiong Fang, Wilfred Ng, Jianlin Feng, and Yuliang Li. 2012. Mining bucket order-preserving submatrices in gene expression data. IEEE Trans. Knowl. Data Eng. 24, 12 (2012), 2218--2231. Google ScholarDigital Library
- Byron J. Gao, Obi L. Griffith, Martin Ester, and Steven J. M. Jones. 2006. Discovering significant OPSM subspace clusters in massive gene expression data. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06). ACM, New York, NY, 922--928. Google ScholarDigital Library
- Byron J. Gao, Obi L. Griffith, Martin Ester, Hui Xiong, Qiang Zhao, and Steven J. M. Jones. 2012. On the deep order-preserving submatrix problem: A best effort approach. IEEE Trans. Knowl. Data Eng. 24, 2 (2012), 309--325. Google ScholarDigital Library
- Gad Getz, Erel Levine, and Eytan Domany. 2000. Coupled two-way clustering analysis of gene microarray data. Proc. Nat. Aca. Sci. 97, 22 (2000), 12079--12084.Google ScholarCross Ref
- Tao Gu, Liang Wang, Zhanqing Wu, Xianping Tao, and Jian Lu. 2011. A pattern mining approach to sensor-based human activity recognition. IEEE Trans. Knowl. Data Eng. 23, 9 (2011), 1359--1372. Google ScholarDigital Library
- Stephan Günnemann, Ines Färber, Kittipat Virochsiri, and Thomas Seidl. 2012. Subspace correlation clustering: Finding locally correlated dimensions in subspace projections of the data. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'12). ACM, New York, NY, 352--360. Google ScholarDigital Library
- Neelima Gupta and Seema Aggarwal. 2010. MIB: Using mutual information for biclustering gene expression data. Pattern Recog. 43, 8 (2010), 2692--2697. Google ScholarDigital Library
- Rohit Gupta, Navneet Rao, and Vipin Kumar. 2010. Discovery of error-tolerant biclusters from noisy gene expression data. In Proceedings of the 9th International Workshop on Data Mining in Bioinformatics (BIOKDD'10). ACM, New York, NY.Google Scholar
- J. A. Hartigan. 1972. Direct clustering of a data matrix. J. Am. Stat. Asso. 67, 337 (1972).Google ScholarCross Ref
- Michael T. Heath. 2002. Scientific Computing: An Introductory Survey. McGraw-Hill Higher Education. Google ScholarDigital Library
- Timothy R. Hughes, Matthew J. Marton, Allan R. Jones, et al. 2000. Functional discovery via a compendium of expression profiles. Cell 102, (2000), 1, 109--126.Google Scholar
- Jens Humrich, Thomas Gartner, and Gemma C. Garriga. 2011. A fixed parameter tractable integer program for finding the maximum order preserving submatrix. In Proceedings of the 11th IEEE International Conference on Data Mining (ICDM'11). IEEE Computer Society, Los Alamitos, CA, 1098--1103. Google ScholarDigital Library
- Trey Ideker, Vesteinn Thorsson, Jeffrey A. Ranish, R. Christmas, J. Bunler, J. Eng, R. Bumgarner, D. Goodlett, R. Aebersold, and L. Hood. 2001. Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science 292, 5518 (2001), 929--934.Google Scholar
- Jan Ihmels, Gilgi Friedlander, Sven Bergmann, Ofer Sarig, Yaniv Ziv, and Naama Barkai. 2002. Revealing modular organization in the yeast transcriptional network. Nature Genetics 31, 4 (2002), 370--377.Google ScholarCross Ref
- Shuiwang Ji, Wenlu Zhang, and Jun Liu. 2012. A sparsity-inducing formulation for evolutionary co-clustering. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'12). ACM, New York, NY, 334--342. Google ScholarDigital Library
- Karin Kailing, Hans-Peter Kriegel, and Peer Kröger. 2004. Density-connected subspace clustering for high-dimensional data. In Proceedings of the 4th SIAM International Conference on Data Mining (SDM'04). SIAM, Philadelphia, PA, 246--257.Google ScholarCross Ref
- Hans-Peter Kriegel, Peer Kröger, and Arthur Zimek. 2009. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data 3, 1 (2009), 1--58. Google ScholarDigital Library
- Hye-Chung Kum, Jian Pei, Wei Wang, and Dean Duncan. 2003. ApproxMAP: Approximate mining of consensus sequential patterns. In Proceedings of the 3rd SIAM International Conference on Data Mining (SDM'02). SIAM, Philadelphia, PA, 311--315.Google ScholarCross Ref
- Mei-Ling Ting Lee, Frank C. Kuo, G. A. Whitmorei, and Jeffrey Sklar. 2000. Importance of replication in microarray gene expression studies: Statistical methods and evidence from repetitive cDNA hybridizations. Proc. Natl. Acad. Sci. 97, 18 (2000), 9834--9839.Google ScholarCross Ref
- Guojun Li, Qin Ma, Haibao Tang, Andrew Paterson, and Ying Xu. 2009. QUBIC: A qualitative biclustering algorithm for analyses of gene expression data. Nucleic Acids Res. 37, 15 (2009), e101.Google ScholarCross Ref
- Jian Li and Amol Deshpande. 2010. Ranking continuous probabilistic datasets. Proc. VLDB Endow. 3, 1 (2010), 638--649. Google ScholarDigital Library
- Jinze Liu and Wei Wang. 2003. OP-Cluster: Clustering by tendency in high dimensional space. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM'03). IEEE Computer Society, Los Alamitos, CA, 187--194. Google ScholarDigital Library
- Bo Long, Zhongfei Zhang, and Philip S. Yu. 2005. Co-clustering by block value decomposition. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'05). ACM, New York, NY, 635--640. Google ScholarDigital Library
- S. C. Madeira, M. C. Teixeira, I. Sa-Correia, and A. L. Oliveira. 2010. Identification of regulatory modules in time series gene expression data using a linear time biclustering algorithm. IEEE/ACM Trans. Comput. Bio. Bioinform. 7, 1 (2010), 153--165. Google ScholarDigital Library
- Sara C. Madeira and Arlindo L. Oliveira. 2004. Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Trans. Computat. Biol. Bioinform. 1, 1 (2004), 24--45. Google ScholarDigital Library
- Gabriela Moise and Jörg Sander. 2008. Finding non-redundant, statistically significant regions in high dimensional data: A novel approach to projected and subspace clustering. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'08). ACM, New York, NY, 533--541. Google ScholarDigital Library
- T. M. Murali and S Kasif. 2003. Extracting conserved gene expression motifs from gene expression data. In Proceedings of the Pacific Symposium on Biocomputing. 77--88.Google Scholar
- Muhammad Muzammal and Rajeev Raman. 2011. Mining sequential patterns from probabilistic databases. In Proceedings of the 15th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD'11). Lecture Notes in Computer Science, vol. 6635. Springer-Verlag, Berlin, 210--221. Google ScholarDigital Library
- Tung T. Nguyen, Richard R. Almon, Debra C. DuBois, William J Jusko, and Ioannis P Androulakis. 2010. Importance of replication in analyzing time-series gene expression data: Corticosteroid dynamics and circadian patterns in rat liver. BMC Bioinform. 11, 279 (2010).Google Scholar
- Feng Pan, Xiang Zhang, and Wei Wang. 2008. CRD: Fast co-clustering on large datasets utilizing sampling-based matrix decomposition. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'08). ACM, New York, NY, 173--184. Google ScholarDigital Library
- Gaurav Pandey, Gowtham Atluri, Michael Steinbach, Chad L. Myers, and Vipin Kumar. 2009. An association analysis approach to biclustering. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'09). ACM, New York, NY, 677--686. Google ScholarDigital Library
- Lance Parsons, Ehtesham Haque, and Huan Liu. 2004. Subspace clustering for high dimensional data: A review. SIGKDD Explore Newsl. 6, 1, 90--105. Google ScholarDigital Library
- Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. 2001. Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. In Proceedings of the IEEE 17th International Conference on Data Engineering (ICDE'01). IEEE Computer Society, Los Alamitos, CA, 215--224. Google ScholarDigital Library
- Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Jianyong Wang, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. 2004. Mining sequential patterns by pattern-growth: The prefixspan approach. IEEE Trans. Knowl. Data Eng. 16, 11, 1424--1440. Google ScholarDigital Library
- Beatriz Pontes, Federico Divina, Raúl Giráldez, and J. S. Aguilar-Ruiz. 2010. Improved biclustering on expression data through overlapping control. Int. J. Intell. Comput. Cybernet. 3 (2010), 293--309.Google Scholar
- Amela Prelić, Stefan Bleuler, Philip Zimmermann, Anja Wille, P. Bünlmann, W. Gruissem, L. Hennig, L. Thiele, and E. Zitzler. 2006. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22, 9 (2006), 1122--1129. Google ScholarDigital Library
- Parisa Rashidi, Diane J. Cook, Lawrence B. Holder, and Maureen Schmitter-Edgecombe. 2011. Discovering activities to recognize and track in a smart environment. IEEE Trans. Knowl. Data Eng. 23, 4 (2011), 527--539. Google ScholarDigital Library
- Christopher Ré, Julie Letchner, Magdalena Balazinksa, and Dan Suciu. 2008. Event queries on correlated probabilistic streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'08). ACM, New York, NY, 715--728. Google ScholarDigital Library
- Chris Seidel. 2008. Introduction to DNA Microarrays. Wiley-VCH Verlag GmbH & Co. KGaA, 1--26.Google Scholar
- Mohamed A. Soliman and Ihab F. Ilyas. 2009. Ranking with uncertain scores. In Proceedings of the IEEE 25th International Conference on Data Engineering (ICDE'09). IEEE Computer Society, Los Alamitos, CA, 317--328. Google ScholarDigital Library
- Ramakrishnan Srikant and Rakesh Agrawal. 1996. Mining sequential patterns: Generalizations and performance improvements. In Proceedings of the International Conference on Extending Database Technology (EDBT'96). ACM, New York, NY, 3--17. Google ScholarDigital Library
- Amos Tanay, Roded Sharan, and Ron Shamir. 2002. Discovering statistically significant biclusters in gene expression data. Bioinformatics 18 (2002), 136--144.Google ScholarCross Ref
- Andrew C. Trapp and Oleg A. Prokopyev. 2010. Solving the order-preserving submatrix problem via integer programming. INFORMS J. Comput. 22, 3 (July 2010), 387--400. Google ScholarDigital Library
- Hua Wang, Feiping Nie, Heng Huang, and Chris Ding. 2011. Nonnegative matrix tri-factorization based high-order co-clustering and its fast implementation. In Proceedings of the 11th IEEE International Conference on Data Mining (ICDM'11). IEEE Computer Society, Los Alamitos, CA, 774--783. Google ScholarDigital Library
- Evan Welbourne, Nodira Khoussainova, Julie Letchner, Yang Li, Magdalena Balazinska, Gaetano Borriello, and Dan Suciu. 2008. Cascadia: A system for specifying, detecting, and managing RFID events. In Proceedings of the 6th International Conference on Mobile Systems, Applications, and Services (MobiSys'08). ACM, New York, NY, 281--294. Google ScholarDigital Library
- Ka Yee Yeung, Mario Medvedovic, and Roger Bumgarner. 2003. Clustering gene-expression data with repeated measurements. Gen. Biol. 4, 5 (2003).Google Scholar
- Kevin Y. Yip, Ben Kao, Xinjie Zhu, Chun Kit Chui, Sau Dan Lee, and David W. Cheung. 2013. Mining order-preserving submatrices from data with repeated measurements. IEEE Trans. Knowl. Data Eng. 25, 7 (2013), 1587--1600. Google ScholarDigital Library
- Mengsheng Zhang, Wei Wang, and Jinze Liu. 2008. Mining approximate order preserving clusters in the presence of noise. In Proceedings of the IEEE 24th International Conference on Data Engineering (ICDE'08). IEEE Computer Society, Los Alamitos, CA, 160--168. Google ScholarDigital Library
- Zhou Zhao, Da Yan, and Wilfred Ng. 2012. Mining probabilistically frequent sequential patterns in uncertain databases. In Proceedings of the 15th International Conference on Extending Database Technology (EDBT'12). ACM, New York, NY, 74--85. Google ScholarDigital Library
Index Terms
- Mining order-preserving submatrices from probabilistic matrices
Recommendations
Mining Order-preserving Submatrices under Data Uncertainty: A Possible-world Approach and Efficient Approximation Methods
Given a data matrix \( D \) , a submatrix \( S \) of \( D \) is an order-preserving submatrix (OPSM) if there is a permutation of the columns of \( S \) , under which the entry values of each row in \( S \) are strictly increasing. OPSM mining is ...
Discovering significant relaxed order-preserving submatrices
KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data miningMining order-preserving submatrix (OPSM) patterns has received much attention from researchers, since in many scientific applications, such as those involving gene expression data, it is natural to express the data in a matrix and also important to find ...
Mining Bucket Order-Preserving SubMatrices in Gene Expression Data
The Order-Preserving SubMatrices (OPSMs) are employed to discover significant biological associations between genes and experiment conditions. Herein, we propose a new relaxed OPSM model by considering the linearity relaxation, which is called the ...
Comments