Classification

Data classification is the process which finds the common properties among a set of objects in a database and classifies them into different classes, according to a classification model. Classification divides a dataset into mutually exclusive groups such that the members of each group are as "close" as possible to one another, and different groups are as "far" as possible from one another, where distance is measured with respect to specific variable(s) you are trying to predict. For example, a typical classification problem is to divide a database of companies into groups that are as homogeneous as possible with respect to a creditworthiness variable with values "Good" and "Bad."

To construct a classification model, a sample database E is treated as the training set, in which each tuple consists of the same set of multiple attributes (or features) as the tuples in a large database W , and additionally, each tuple has a known class identity (label) associated with it. The objective of the classification is to first analyze the training data and develop an accurate description or a model for each class using the features available in the data. Such class descriptions are then used to classify future test data in the database W or to develop a better description (called classification rules) for each class in the database. Applications of classification include medical diagnosis, performance prediction, selective marketing, to name a few.

Data classification has been studied substantially in statistics, machine learning, neural networks, and expert systems and is an important theme in data mining.

Algorithms

ID-3

C4.5

Basian Network

Nearest neighbor

References

[1]	S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, 1991.
[2]	J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
[3]	P. Cheeseman and J. Stutz. Bayesian classification (AutoClass): Theory and results. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 153-180. AAAI/MIT Press, 1996.
[4]	J. Elder IV and D. Pregibon. A statistical perspective on knowledge discovery in databases. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 83-115. AAAI/MIT Press, 1996.
[5]	R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, and A. Swami. An Interval Classifier for Database Mining Applications. Proceedings of the 18th International Conference on Very Large Data Bases, pages 560-573, August 1992.
[6]	T.M. Anwar, H.W. Beck, and S.B. Navathe. Knowledge Mining by Imprecise Querying: A Classification-Based Approach. Proceedings of the 8th International Conference on Data Engineering, pages 622-630, February 1992.
[7]	M.-S. Chen and P. S. Yu. Using Multi-Attribute Predicates for Mining Classification Rules. IBM Research Report, 1995.
[8]	A. Srivastava, E.-H. Han, V. Kumar, and V. Singh. Parallel formulations of decision-tree classification algorithms. Data Mining and Knowledge Discovery: An International Journal, 3(3):237-261, September 1999.
[9]	J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data mining. In Proc. of the 22nd VLDB Conference, 1996.
[10]	A. K. Murthy, Automatic Construction o f Decision Trees from Data: A Multi-Disciplinary Survery, Data Mining and Knowledge Discovery, Vol. 2, 1998.
[11]	J. R. Quinlan, "Boosting, Bagging, and C4.5", AAAI'96.
[12]	B. Liu, W. Hsu and Y. Ma, Integrating Classification and Association Rule Mining, KDD98, New York, Aug. 1998.
[13]	M. Mehta, R. Agrawal and J. Rissanen: "SLIQ: A Fast Scalable Classifier for Data Mining", Proc. of the Fifth Int'l Conference on Extending Database Technology, Avignon, France, March 1996.