Clustering Analysis

Clustering

Clustering is the process which groupes physical or abstract objects into classes of similar objects. Clustering analysis helps construct meaningful partitioning of a large set of objects based on a "divide and conquer" methodology which decomposes a large scale system into smaller components to simplify design and implementation. Clustering divides a dataset into mutually exclusive groups such that the members of each group are as "close" as possible to one another, and different groups are as "far" as possible from one another, where distance is measured with respect to all available variables.

As a data mining task, data clustering identifies clusters, or densely populated regions, according to some distance measurement, in a large, multidimensional data set. Given a large set of multidimensional data points, the data space is usually not uniformly occupied by the data points. Data clustering identi,es the sparse and the crowded places, and hence discovers the overall distribution patterns of the data set.

Data clustering has been studied in statistics, machine learning, spatial database, and data mining areas with different emphases.

As a branch of statistics, clustering analysis has been studied extensively for many years, mainly focused on distance-based clustering analysis. Systems based on statistical classification methods, such as AutoClass which uses a Bayesian classification method, have been used in clustering in real world databases with reported success.

The distance-based approaches assume that all the data points are given in advance and can be scanned frequently. They are global or semi-global methods at the granularity of data points. That is, for each clustering decision, they inspect all data points or all currently existing clusters equally no matter how close or far away they are, and they use global measurements, which require scanning all data points or all currently existing clusters. Hence they do not have linear scalability with stable clustering quality.

In machine learning, clustering analysis often refers to unsupervised learning, since which classes an object belongs to are not prespecified, or conceptual clustering, because the distance measurement may not be based on geometric distance, but be based on that a group of objects represents a certain conceptual class. One needs to define a measure of similarity between the objects and then apply it to determine classes. Classes are defined as collections of objects whose intraclass similarity is high and interclass similarity is low.

The method of clustering analysis in conceptual clustering is mainly based on probability analysis. Such approaches, represented by, make the assumption that probability distributions on separate attributes are statistically independent of each other. This assumption is, however, not always true since correlation between attributes often exists. Moreover, the probability distribution representation of clusters makes it very expensive to update and store the clusters. This is especially so when the attributes have a large number of values since their time and space complexities depend not only on the number of attributes, but also on the number of values for each attribute. Furthermore, the probability-based tree that is built to identify clusters is not height-balanced for skewed input data, which may cause the time and space complexity to degrade dramatically.

Clustering analysis in large databases has been studied recently in the database community.

Algorithms

WARD

COBWEB

References

[1]	D. Fisher. Improving inference through conceptual clustering. In Proc. 1987 AAAI Conf., pages 461-465, Seattle, Washington, July 1987.
[2]	D. Fisher. Optimization and simpli,cation of hierarchical clusterings. In Proc. 1st Int. Conf. on Knowledge Discovery and Data Mining (KDD'95), pages 118-123, Montreal, Canada, Aug. 1995.
[3]	R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. In Proc. 1994 Int. Conf. Very Large Data Bases, pages 144-155, Santiago, Chile, September 1994.
[4]	L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990.
[5]	A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
[6]	T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering method for very large databases. In Proc. 1996 ACM-SIGMOD Int. Conf. Management of Data, Montreal, Canada, June 1996.
[7]	H. V. Jagadish, L. V.S. Lakshmanan, and D. Srivastava, Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse, SIGMOD 99.
[8]	Rakesh Agrawal, Johannes Gehrke: Dimitrios Gunopulos, Prabhakar Raghavan: "Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications", ACM SIGMOD , Seattle, Washington, June 1998.
[9]	C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S. Yu and J. S. Park, Fast Algorithms for Projected Clustering, SIGMOD 99.
[10]	M. Ankerst, M. M. Breuing, H. Kriegel and J. Sander, OPTICS: Ordering Points To Identify the Clustering Structure, SIGMOD 99.
[11]	A. A. Freitas and S. H. Lavington, Mining Very Large Databases With Parallel Processing, Kluwer Academic Publishers, 1998.
[12]	Gary L. Miller, ShangHua Teng, William Thurston, and Stephen A. Vavasis. Separators for spherepackings and nearest neighbor graphs. Journal of the ACM, 44(1):1--29, January 1997.
[13]	F. Murtagh. A survey of recent advances in hierarchical clustering algorithms. The Computer Journal, 26:354--359, 1983.
[14]	David W. Matula. Classification and Clustering, chapter Graph Theoretic Techniques for Cluster Analysis Algorithms. Academic Press, Inc., 1977.
[15]	U. M. Fayyad, G. Piatetsky-shapiro, P. Smyth and R. Uthurusamy, Advances in Knowledge Discovery and Data Mining, AAAI Press/The MIT Press, 1996.