Clustering is the process which groupes physical or abstract
objects into classes of similar objects. Clustering analysis helps
construct meaningful partitioning of a large set of objects based
on a "divide and conquer" methodology which decomposes
a large scale system into smaller components to simplify design
and implementation. Clustering divides a dataset into mutually
exclusive groups such that the members of each group are as "close"
as possible to one another, and different groups are as "far"
as possible from one another, where distance is measured with
respect to all available variables.
As a data mining task, data clustering identifies clusters, or
densely populated regions, according to some distance measurement,
in a large, multidimensional data set. Given a large set of multidimensional
data points, the data space is usually not uniformly occupied
by the data points. Data clustering identi,es the sparse and the
crowded places, and hence discovers the overall distribution patterns
of the data set.
Data clustering has been studied in statistics, machine learning,
spatial database, and data mining areas with different emphases.
As a branch of statistics, clustering analysis has been studied
extensively for many years, mainly focused on distance-based clustering
analysis. Systems based on statistical classification methods,
such as AutoClass which uses a Bayesian classification method,
have been used in clustering in real world databases with reported
success.
The distance-based approaches assume that all the data points
are given in advance and can be scanned frequently. They are global
or semi-global methods at the granularity of data points. That
is, for each clustering decision, they inspect all data points
or all currently existing clusters equally no matter how close
or far away they are, and they use global measurements, which
require scanning all data points or all currently existing clusters.
Hence they do not have linear scalability with stable clustering
quality.
In machine learning, clustering analysis often refers to unsupervised
learning, since which classes an object belongs to are not prespecified,
or conceptual clustering, because the distance measurement may
not be based on geometric distance, but be based on that a group
of objects represents a certain conceptual class. One needs to
define a measure of similarity between the objects and then apply
it to determine classes. Classes are defined as collections of
objects whose intraclass similarity is high and interclass similarity
is low.
The method of clustering analysis in conceptual clustering is
mainly based on probability analysis. Such approaches, represented
by, make the assumption that probability distributions on separate
attributes are statistically independent of each other. This assumption
is, however, not always true since correlation between attributes
often exists. Moreover, the probability distribution representation
of clusters makes it very expensive to update and store the clusters.
This is especially so when the attributes have a large number
of values since their time and space complexities depend not only
on the number of attributes, but also on the number of values
for each attribute. Furthermore, the probability-based tree that
is built to identify clusters is not height-balanced for skewed
input data, which may cause the time and space complexity to degrade
dramatically.
Clustering analysis in large databases has been studied recently
in the database community.
|
|
[1] |
D. Fisher. Improving inference through conceptual
clustering. In Proc. 1987 AAAI Conf., pages 461-465, Seattle,
Washington, July 1987. |
[2] |
D. Fisher. Optimization and simpli,cation of hierarchical
clusterings. In Proc. 1st Int. Conf. on Knowledge Discovery
and Data Mining (KDD'95), pages 118-123, Montreal, Canada,
Aug. 1995. |
[3] |
R. Ng and J. Han. Efficient and effective clustering method
for spatial data mining. In Proc. 1994 Int. Conf. Very Large
Data Bases, pages 144-155, Santiago, Chile, September 1994. |
[4] |
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data:
an Introduction to Cluster Analysis. John Wiley & Sons,
1990. |
[5] |
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data.
Printice Hall, 1988. |
[6] |
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient
data clustering method for very large databases. In Proc.
1996 ACM-SIGMOD Int. Conf. Management of Data, Montreal, Canada,
June 1996. |
[7] |
H. V. Jagadish, L. V.S. Lakshmanan, and D. Srivastava, Snakes
and Sandwiches: Optimal Clustering Strategies for a Data Warehouse,
SIGMOD 99. |
[8] |
Rakesh Agrawal, Johannes Gehrke: Dimitrios Gunopulos, Prabhakar
Raghavan: "Automatic Subspace Clustering of High Dimensional
Data for Data Mining Applications", ACM SIGMOD , Seattle,
Washington, June 1998. |
[9] |
C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S. Yu and J.
S. Park, Fast Algorithms for Projected Clustering, SIGMOD
99. |
[10] |
M. Ankerst, M. M. Breuing, H. Kriegel and J. Sander, OPTICS:
Ordering Points To Identify the Clustering Structure, SIGMOD
99. |
[11] |
A. A. Freitas and S. H. Lavington, Mining Very Large Databases
With Parallel Processing, Kluwer Academic Publishers, 1998.
|
[12] |
Gary L. Miller, ShangHua Teng, William Thurston, and
Stephen A. Vavasis. Separators for spherepackings and
nearest neighbor graphs. Journal of the ACM, 44(1):1--29,
January 1997. |
[13] |
F. Murtagh. A survey of recent advances in hierarchical
clustering algorithms. The Computer Journal, 26:354--359,
1983. |
[14] |
David W. Matula. Classification and Clustering, chapter
Graph Theoretic Techniques for Cluster Analysis Algorithms.
Academic Press, Inc., 1977. |
[15] |
U. M. Fayyad, G. Piatetsky-shapiro, P. Smyth and R. Uthurusamy,
Advances in Knowledge Discovery and Data Mining, AAAI Press/The
MIT Press, 1996. |
|
|