一種基于GA的混合屬性特征大數(shù)據(jù)集聚類算法
A GA-Based Clustering Algorithm for Large Data Sets with Mixed Numerical and Categorical Values
-
摘要: 在數(shù)據(jù)挖掘中,經(jīng)常會遇到和分析大量具有數(shù)值和類屬特征的數(shù)據(jù)。然而,現(xiàn)有的大多數(shù)算法只能單獨處理數(shù)值特征數(shù)據(jù)或類屬特征數(shù)據(jù),而不能分析具有混合屬性的數(shù)據(jù)。為此,該文提出了一種基于GA的模糊聚類新算法,通過改進(jìn)聚類目標(biāo)函數(shù)將數(shù)值特征與類屬特征相結(jié)合,從而實現(xiàn)具有混合屬性特征數(shù)據(jù)的聚類分析;通過引入GA算法能夠快速得到全局最優(yōu)解,而且不依賴于原型初始化。實驗結(jié)果表明,基于GA的新聚類算法對于處理具有混合特征的大數(shù)據(jù)集聚類問題是相當(dāng)有效的。
-
關(guān)鍵詞:
- 聚類分析; 數(shù)值特征; 類屬特征; 遺傳算法
Abstract: In the field of data mining, it is often encountered to perform cluster analysis on large data sets with mixed numerical and categorical values. However, most existing clustering algorithms are only efficient for the numerical data rather than the mixed data set. For this purpose, this paper presents a novel clustering algorithm for these mixed data sets by modifying the common cost function, trace of the within cluster dispersion matrix. The Genetic Algorithm (GA) is used to optimize the new cost function to obtain valid clustering result. Experimental result illustrates that the GA-based new clustering algorithm is feasible for the large data sets with mixed numerical and categorical values. -
Klosgen W,Zytkow J M.Knowledge Discovery in Databases Terminology.Advances in Knowledge Discovery and Data Mining,Fayyad U M,Piatetsky-Shapiro G,Smyth P,Uthurusamy R.(Eds.),AAAI Press/The MIT Press,MA,1996:573-592.[2]Cormack R M.A review of classification[J].J.Roy.Statist.Soc.Series A.1971,134:321-367[3]IBM.Data Management Solutions.IBM White Paper,IBM Corp.1996.[4]Anderberg M B.Cluster Analysis for Applications.New York:Academic Press.1973:79-90.[5]Kaufman L,Rousseeuw P J.Finding Groups in Data-An Introduction to Cluster Analysis.New York:John Wiley,1990:98-110.[6]Everitt B.Cluster Analysis.New York:Heinemann Educational Books Ltd.,1974:45-60.[7]Huang Zhexue,Michael K N.A fuzzy k-modes algorithm for clustering categorical data[J].IEEE Trans.on Fuzzy Systems.1999,7(4):446-452[8]Zhexue Huang.A fast clustering algorithm to cluster very large categorical data sets in data mining.Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery,Dept.of Computer Science,The University of British Columbia,Canada,1997:1-8.[9]Holland J H.Adoption in Natural and Artificial System.Ann Arbor,MI:Univ.Mich.Press,1975:83-90.[10]Krovi R.Genetic algorithm for clustering:A preliminary investigation.Proceedings of the 25th Hawaii International Conf.on System Sciences,4,Information Systems,Hawaii,1992:504-544. -
計量
- 文章訪問數(shù): 2947
- HTML全文瀏覽量: 127
- PDF下載量: 980
- 被引次數(shù): 0