10

March 26, 2018 | Author: Rishabh Agarwal | Category: Cluster Analysis, Data Mining, Data Analysis, Areas Of Computer Science, Technology


Comments



Description

Chapter 10Cluster Analysis: Basic Concepts and Methods 10.1 Bibliographic Notes Clustering has been extensively studied for over 40 years and across many disciplines due to its broad applications. Most books on pattern classification and machine learning contains chapters on cluster analysis or unsupervised learning. Several textbooks are dedicated to the methods of cluster analysis, including Hartigan [Har75], Jain and Dubes [JD88], Kaufman and Rousseeuw [KR90], and Arabie, Hubert, and De Sorte [AHS96]. There are also many survey articles on different aspects of clustering methods. Recent ones include Jain, Murty, and Flynn [JMF99], Parsons, Haque, and Liu [PHL04], and Jain [Jai10]. For partitioning methods, the k-means algorithm was first introduced by Lloyd [Llo57], and then by MacQueen [Mac67]. Arthur and Vassilvitskii [AV07] presented the k-means++ algorithm. A filtering algorithm, which uses a spatial hierarchical data index to speed up the computation of cluster means, is given in Kanungo, Mount, Netanyahu, Piatko, Silverman, and Wu [KMN+ 02]. The k-medoids algorithms of PAM and CLARA were proposed by Kaufman and Rousseeuw [KR90]. The k-modes (for clustering nominal data) and kprototypes (for clustering hybrid data) algorithms were proposed by Huang [Hua98]. The k-modes clustering algorithm was also proposed independently by Chaturvedi, Green, and Carroll [CGC94, CGC01]. The CLARANS algorithm was proposed by Ng and Han [NH94]. Ester, Kriegel, and Xu [EKX95] proposed techniques for further improvement of the performance of CLARANS using efficient spatial access methods, such as R*-tree and focusing techniques. A k-means-based scalable clustering algorithm was proposed by Bradley, Fayyad, and Reina [BFR98]. An early survey of agglomerative hierarchical clustering algorithms was conducted by Day and Edelsbrunner [? ]. Agglomerative hierarchical clustering, such as AGNES, and divisive hierarchical clustering, such as DIANA, were in1 HBV01]. Breunig. WaveCluster. and Yu [AHWY04]. A method for clustering evolving data streams was proposed by Aggarwal. For high-dimensional clustering. AGAV09]. an Apriori-based dimension-growth subspace clustering algorithm called CLIQUE was proposed by Agrawal. It integrates density-based and gridbased clustering methods. A probabilistic hierarchical clustering framework following normal linkage algorithms and using probabilistic models to define cluster similarity was developed by Friedman [Fri03]. Gehrke. in Bezdek [Bez81]. Some recent studies include [Mei03. Yang. and Shim [GRS99]. For density-based clustering methods. such as [JD88. and O’Callaghan [GMMO00]. was proposed by Wang. and Kumar [KHK99]. and Muntz [WYM97]. Rastogi. a cluster ordering method that facilitates density-based clustering without worrying about parameter specification. Mishra. and Yu [AHWY03]. STING. a grid-based multiresolution approach that collects statistical information in grid cells. and Livny [ZRL96]. Ramakrishnan. by Guha. Gunopulos. Ankerst. or nearest neighbor analysis. Recent studies have proceeded to clustering stream data [BBD+ 02]. Scalable methods for clustering nominal data were studied by Gibson. and Xu [EKSX96]. Motwani. and Shim [GRS99]. developed by Sheikholeslami. is a multiresolution clustering approach that transforms the original feature space by wavelet transform. Rastogi. The DENCLUE algorithm. and Raghavan [AGGR98]. and Shim [GRS98]. and Chameleon by Karypis. [OMM+ 02]. There are also many other clustering paradigms. Han.2CHAPTER 10. was proposed by Hinneburg and Keim [HK98]. BIRCH. Clustering evaluation is discussed in a few monographs and survey articles. Wang.0 which includes a new hill climbing procedure for Gaussian kernels adjusting the step size automatically. Sander. such as CURE by Guha. ROCK (for clustering nominal attributes) by Guha. fuzzy clustering methods are discussed in Kaufman and Rousseeuw [KR90]. For example. A framework for projected clustering of high-dimensional data streams was proposed by Aggarwal. A kmedian based data stream clustering algorithm was proposed by Guha. and by Ganti. by Zhang. and Raghavan [GKR98]. Chatterjee. transformation. based on a set of density distribution functions. The extrinsic methods for clustering quality evaluation are extensively explored. DBSCAN was proposed by Ester. and Sander [ABKS99] developed OPTICS. Han. and Zhang [SCZ98]. An interesting direction for improving the clustering quality of hierarchical clustering methods is to integrate hierarchical clustering with distance-based iterative relocation or other non-hierarchical clustering methods. first performs hierarchical clustering with a CF-tree before applying other techniques. and Ramakrishnan [GGR99]. Hierarchical clustering can also be performed by sophisticated linkage analysis. Rastogi. Hinneburg and Gabriel [HG07] developed DENCLUE 2. CLUSTER ANALYSIS: BASIC CONCEPTS AND METHODS troduced by Kaufman and Rousseeuw [KR90]. and in Bezdek and Pal [BP92]. Kleinberg. Kriegel. Mei05. Han. The four essential criteria introduced in this chapter are formulated . For example. Gehrke. and Heller and Ghahramani [HG05]. Wang. Kriegel. and by O’Callaghan et al. Bagga and Baldwin [BB98] introduced the BCubed metrics. The silhouette coefficient is described in [KR90].1. for example. BIBLIOGRAPHIC NOTES 3 in [AGAV09]. RH07]. . in [Mei03. while some individual criteria are also mentioned earlier.10. 4CHAPTER 10. CLUSTER ANALYSIS: BASIC CONCEPTS AND METHODS . pages 1027–1035. [BB98] A. 1999 ACM-SIGMOD Int. and P. Management of Data (SIGMOD’98). Entity-based cross-document coreferencing using the vector space model. 1998 ACM-SIGMOD Int. Wang. Gunopulos. H. Aggarwal. Sander. Canada. 2007 ACM-SIAM Symp. Toronto. Kriegel. Gehrke. J. In Proc. De Soete. [AHS96] P. OPTICS: Ordering points to identify the clustering structure. J. Wang. WA. June 1999. Information Retrieval. D. Aug. 1996. [AGGR98] R. K-means++: The advantages of careful seeding. [AHWY03] C. [AGAV09] E. and G. Arabie. Conf. 2007. Computational Linguistics (COLING-ACL’98). L. Arthur and S.Bibliography [ABKS99] M. Very Large Data Bases (VLDB’04). Vassilvitskii. Very Large Data Bases (VLDB’03). 5 . Discrete Algorithms (SODA’07). 2004. pages 49–60. and F. Hubert. J. Conf. [AV07] D. Breunig. M. pages 852–863. 2003. Baldwin. J. J. and J. Automatic subspace clustering of high dimensional data for data mining applications. Bagga and B. and P. 2003 Int. Aggarwal. In Proc. J. Clustering and Classification. In Proc. pages 94–105. Artiles. Sept. Han. 1998 Annual Meeting of the Association for Computational Linguistics and Int. Aug. A framework for clustering evolving data streams. In Proc. Management of Data (SIGMOD’99). In Proc. Yu. Raghavan. Amig´. June 1998. Canada. 1998. 12. Conf. S. Tokyo. Conf. Gonzalo. Yu. Conf. C. World Scientific. Japan. Seattle. S. [AHWY04] C. Montreal. A framework for projected clustering of high dimensional data streams.-P. Han. 2004 Int. Berlin. 2009. J. J. Philadelphia. pages 81–92. Agrawal. C. In Proc. A comparison of o extrinsic clustering evaluation metrics based on formal constraints. Germany. Verdejo. Ankerst. and P. PA. 1992. Clustering categorical data: An approach based on dynamical systems. [Bez81] J. N. Green. J. 1998 Int. Babcock. C. Friedman.-P. Conf. Mishra. Green. In Proc. R. A density-based algorithm for discovering clusters in large spatial databases. 1996. In Technical Report 2003-80. Guha. J. Aug. Xu. In Proc. 1999 Int. Conf. and L. Widom. . P. In Proc. NY. 1999. 1981. In The Classification Society of North America (CSNA) Meeting Presentation.-P. Clustering data streams. In Proc. Kriegel. Pal. and R. U. TX. H. Knowledge Discovery and Data Mining (KDD’96). 2003. Gehrke. New York. NY. and X. Pattern Recognition with Fuzzy Objective Function Algorithms. Bradley. In Proc. New York. Portland. 18:35–55. June 2002. Redondo Beach. Symp. Motwani. and C. H. Datar. and X. Portland. pages 67–82. and J. Classification. [EKX95] M. Pcluster: Probabilistic agglomerative clustering of gene expression profiles. CA. Scaling clustering algorithms to large databases. Chaturvedi. Knowledge Discovery and Data Mining (KDD’99). M. Aug. P. 2002 ACM Symp. Bezdek. Ester. Motwani. 1998. Principles of Database Systems (PODS’02). k-medians and k-modes: Special cases of partitioning multiway data. [BP92] J. Fayyad. pages 311–323. 1998 Int. pages 226–231. Plenum Press. S. Carroll. [GKR98] D.6 BIBLIOGRAPHY [BBD+ 02] B. [Fri03] N. Reina. and P. [EKSX96] M. Xu. 1998. Conf. pages 359–366. Madison. [GGR99] V. Aug. Hebrew Univ. Aug. Ester. C. OR. Very Large Data Bases (VLDB’98). In Proc. Large Spatial Databases (SSD’95). [CGC01] A. Raghavan. [GMMO00] S. J. Kriegel. [BFR98] P. M.. Bezdek and S. [CGC94] A. Kleinberg. Foundations of Computer Science (FOCS’00). K. pages 9–15. Ramakrishnan. 2000 Symp. J. Fuzzy Models for Pattern Recognition: Methods That Search for Structures in Data. 2000. ME. Houston. 1995. pages 1–16. CA. E. and J. Sander. Ganti. 2001. CACTUS— clustering categorical data using summaries. Conf. K-means. 1996 Int. Models and issues in data stream systems. Gibson. In Proc. San Diego. Carroll. WI. pages 73– 83. Knowledge discovery in large spatial databases: Focusing techniques for efficient class identification. IEEE Press. and J. K-modes clustering. 1995 Int. O’Callaghan. Chaturvedi. R. 1994. Knowledge Discovery and Data Mining (KDD’98). Babu. 1998. 2007. A. Batistakis. Hartigan. Data clustering: 50 years beyond k-means. and M. Ljubljana. Slovenia. 1999. Pattern Recognition Lett. Clustering Algorithms. In Proc. Karypis. Kumar. R. Hinneburg and D. pages 70–80. DENCLUE 2. New York. A. Huang. [Har75] J. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. N. Data Engineering (ICDE’99). and K. Y. 2007 Int. Jain. CURE: An efficient clustering algorithm for large databases. 32:68–75. Han. Keim. An efficient k-means clustering algorithm: Analysis and implementation. C. Y. 2002. Mar. Bonn. 22nd Int. NY. pages 512–521. pages 58–65. 17:107–145. [GRS99] S. Data Mining and Knowledge Discovery. [KMN+ 02] T. . Conf. Aug. Prentice Hall. Machine Learning (ICML’05). 2:283–304. Ghahramani. Netanyahu. Jain. K. ACM Comput. S. Heller and Z. In Proc. Shim. and P. [JMF99] A. Intell. Rastogi. Rastogi. J. John Wiley & Sons. [Hua98] Z.0: Fast clustering based on kernel density estimation. Guha. Guha. Murty. Piatko. Extensions to the k-means algorithm for clustering large data sets with categorical values. [HG07] A.. Australia. 1988. In Proc. Conf. ROCK: A robust clustering algorithm for categorical attributes. Wu. Flynn. Inf. Algorithms for Clustering Data. [HBV01] M. Conf. An efficient approach to clustering in large multimedia databases with noise. 2010.. E. Bayesian hierarchical clustering. Halkidi. Jain and R. K. M.. 1999. 1999 Int. In Proc. On clustering validation techniques. Mount. D. Conf. 1998 ACM-SIGMOD Int. [HG05] K. [HK98] A. IEEE Trans. [Jai10] A. Conf. WA. [KHK99] G. 31. Gabriel. Sydney. Vazirgiannis. J. [JD88] A. N. A. D. Intelligent Data Analysis (IDA’07). and V. M. pages 297– 304.-H. 1975. Germany. and A.BIBLIOGRAPHY 7 [GRS98] S. In Proc. C. and K. Knowledge Discovery and Data Mining (KDD’98). Syst. Kanungo.-H. 1998 Int. Surv. Management of Data (SIGMOD’98). pages 73–84. Seattle. K. June 1998. Data clustering: A survey. 1998. 1999. 31:264–323. Dubes. Pattern Analysis and Machine Intelligence (PAMI). Hinneburg and H. 2005. 24:881–892. Silverman. R. COMPUTER. R. 2001. Shim. Parsons. 1998 Int. STING: A statistical information grid approach to spatial data mining. June 2007. and A. 1967. Canada. Conf. Rousseeuw. Meilˇ. MacQueen. Subspace clustering for high dimensional data: A review. In Proc. 1957). 22nd a Int. Some methods for classification and analysis of multivariate observations. NY. N. Comparing clusterings: an axiomatic view.8 BIBLIOGRAPHY [KR90] L. 2003. Aug. Conf. [Mei03] M. Meilˇ. Least squares quantization in PCM. A. Motwani. pages 103–114. J. Very Large Data Bases (VLDB’94). on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’07). Wang. and S. Sheikholeslami. SIGKDD Explorations. 28:128–137. J. [RH07] A. 1982 (original version: Technical Report. Aug. 5th Berkeley Symp. 1:281–297. Greece. pages 577–584. Chile. Muntz. June 1996. Finding Groups in Data: An Introduction to Cluster Analysis. Lloyd. Data Engineering (ICDE’02). BIRCH: an efficient data clustering method for very large databases. Streaming-data algorithms for high-quality clustering. Conf. 2007 Joint Conf. John Wiley & Sons. pages 173–187. Aug. 1994 Int. P. pages 186–195. 2002 Int. Montreal. Athens. Liu. Bell Labs. Yang. New York. In Proc. In Proc. R. In Proc. DC. Czech Republic. Santiago. Bonn. 1997 Int. Comparing clusterings by the variation of informaa tion. CA. Washington. Mathematical Statistics and Probability. pages 144–155. pages 428–439. WaveCluster: A multi-resolution clustering approach for very large spatial databases. Conf. Conf. Management of Data (SIGMOD’96). Information Theory. Prague. 2002. In Proc. Sept. 1998. Zhang. [Mac67] J. Proc. [NH94] R. E. San Fransisco. Very Large Data Bases (VLDB’98). [OMM+ 02] L. and R. Zhang. [ZRL96] T. . IEEE Trans. Chatterjee. R. Very Large Data Bases (VLDB’97). 2005. Computational Learning Theory (COLT’03). V-measure: A conditional entropybased external cluster evaluation measure. Ramakrishnan. 2004. 1997. Ng and J. 6:90–105. In Proc. and H. 1996 ACM-SIGMOD Int. Guha. and M. In Proc. [SCZ98] G. [Mei05] M. [WYM97] W. pages 410–420. Germany. S. Efficient and effective clustering method for spatial data mining. pages 685–696. 16th Annual Conf. Hirschberg. Meyerson. 1990. Conf. [PHL04] L. Machine Learning (ICML’05). 1994. Livny. [Llo57] S. Apr. In Proc. Mishra. O’Callaghan. Haque. Kaufman and P. Rosenberg and J. Han.
Copyright © 2024 DOKUMEN.SITE Inc.