大數(shù)據(jù)中一種基于語義特征閾值的層次聚類方法
doi: 10.11999/JEIT150422 cstr: 32379.14.JEIT150422
-
2.
(中南大學(xué)信息科學(xué)與工程學(xué)院 長沙 410083) ②(湖南科技學(xué)院電子與信息學(xué)院 永州 425006)
基金項(xiàng)目:
國家自然科學(xué)基金(60173037, 6272496, 61272151),湖南省教育廳科研項(xiàng)目(2015C0589),湖南科技學(xué)院重點(diǎn)學(xué)科項(xiàng)目
A Hierarchical Clustering Method Based on the Threshold of Semantic Feature in Big Data
-
2.
(School of Information Science and Engineering, Central South University, Changsha 410083, China)
Funds:
The National Natural Science Foundation of China (60173037, 6272496, 61272151)
-
摘要: 云計(jì)算、健康醫(yī)療、街景地圖服務(wù)、推薦系統(tǒng)等新興服務(wù)促使數(shù)據(jù)的種類和規(guī)模以前所未有的速度增長,數(shù)據(jù)量的激增會(huì)導(dǎo)致很多共性問題。例如數(shù)據(jù)的可表示,可處理和可靠性問題。如何有效處理和分析數(shù)據(jù)之間的關(guān)系,提高數(shù)據(jù)的劃分效率,建立數(shù)據(jù)的聚類分析模型,已經(jīng)成為學(xué)術(shù)界和企業(yè)界共同亟待解決的問題。該文提出一種基于語義特征的層次聚類方法,首先根據(jù)數(shù)據(jù)的語義特征進(jìn)行訓(xùn)練,然后在每個(gè)子集上利用訓(xùn)練結(jié)果進(jìn)行層次聚類,最終產(chǎn)生整體數(shù)據(jù)的密度中心點(diǎn),提高了數(shù)據(jù)聚類效率和準(zhǔn)確性。此方法采樣復(fù)雜度低,數(shù)據(jù)分析準(zhǔn)確,易于實(shí)現(xiàn),具有良好的判定性。
-
關(guān)鍵詞:
- 大數(shù)據(jù) /
- 數(shù)據(jù)抽取 /
- 層次聚類 /
- 聚類分析
Abstract: The type and scale of data has been promoted with a hitherto unknown speed by the emerging services including cloud computing, health care, street view services recommendation system and so on. However, the surge in the volume of data may lead to many common problems, such as the representability, reliability and handlability of data. Therefore, how to effectively handle the relationship between the data and the analysis to improve the efficiency of classification of the data and establish the data clustering analysis model has become an academic and business problem, which needs to be solved urgently. A hierarchical clustering method based on semantic feature is proposed. Firstly, the data should be trained according to the semantic features of data, and then is used the training result to process hierarchical clustering in each subset; finally, the density center point is produced. This method can improve the efficiency and accuracy of data clustering. This algorithm is of low complexity about sampling, high accuracy of data analysis and good judgment. Furthermore, the algorithm is easy to realize.-
Key words:
- Big data /
- Data extraction /
- Hierarchical clustering /
- Clustering analysis
-
程學(xué)旗, 靳小龍, 王元卓, 等. 大數(shù)據(jù)系統(tǒng)和分析技術(shù)綜述[J]. 軟件學(xué)報(bào), 2014, 25(9): 1889-1909. Cheng Xue-qi, Jin Xiao-long, Wang Yuan-zhuo, et al.. Survey on big data system and analytic technology[J]. Journal of Software, 2014, 25(9): 1889-1909. Du Y, He Y, Tian Y, et al.. Microblog bursty topic detection based on user relationship[C]. IEEE 6th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), Chongqing, China, 2011, 1: 260-263. 孫吉貴, 劉杰, 趙連宇. 聚類算法研究[J]. 軟件學(xué)報(bào), 2008, 19(1): 48-61. Sun Ji-gui, Liu Jie, and Zhao Lian-yu. Clustering algorithms research[J]. Journal of Software, 2008, 19(1): 48-61. Choromanska A, Jebara T, Kim H, et al.. Fast spectral clustering via the nystr?m method[C]. Proceedings of the 24th International Conference, Algorithmic Learning Theory 2013, Singapore, 2013: 367-381. Hearn T A and Reichel L. Fast computation of convolution operations via low-rank approximation[J]. Applied Numerical Mathematics, 2014, (75): 136-153. Gajjar M R, Sreenivas T V, and Govindarajan R. Fast computation of Gaussian likelihoods using low-rank matrix approximations[C]. 2011 IEEE Workshop on Signal Processing Systems (SiPS), Beirut, Lebanon, 2011: 322-327. 崔穎安, 李雪, 王志曉, 等. 社會(huì)化媒體大數(shù)據(jù)多階段整群抽樣方法[J]. 軟件學(xué)報(bào), 2014, 25(4): 781-796. Cui Ying-an, Li Xue, Wang Zhi-xiao, et al.. Sampling online social media big data based multi stage cluster method[J]. Journal of Software, 2014, 25(4): 781-796. Chen W Y, Song Y, Bai H, et al.. Parallel spectral clustering in distributed systems[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(3): 568-586. 丁世飛, 賈洪杰, 史忠植. 基于自適應(yīng) Nystrm采樣的大數(shù)據(jù)譜聚類算法[J]. 軟件學(xué)報(bào), 2014, 25(9): 2037-2049. Ding Shi-fei, Jia Hong-jie, and Shi Zhong-zhi. Spectral clustering algorithm based on adaptive Nystrm sampling for big data analysis[J]. Journal of Software, 2014, 25(9): 2037-2049. Chen X and Cai D. Large scale spectral clustering with landmark-based representation[C]. Proceedings of the 25th AAAI Conference on Artificaial Inteligence, San Francisco, USA, 2011: 313-318. 慈祥, 馬友忠, 孟小峰. 一種云環(huán)境下的大數(shù)據(jù)Top-K查詢方法[J]. 軟件學(xué)報(bào), 2014, 25(4): 813-825. Ci Xiang, Ma You-zhong, and Meng Xiao-feng. Method for Top-K query on big data in cloud[J]. Journal of Software, 2014, 25(4): 813-825. Horng S J, Su M Y, Chen Y H, et al.. A novel intrusion detection system based on hierarchical clustering and support vector machines[J]. Expert Systems with Applications, 2011, 38(1): 306-313. Bahmani B, Moseley B, Vattani A, et al.. Scalable k- means++[J]. Proceedings of the VLDB Endowment, 2012, 5(7): 622-633. Zhang X and You Q. Clusterability analysis and incremental sampling for Nystrm extension based spectral clustering[C]. IEEE 11th International Conference on Data Mining (ICDM) , Vancouver, Canada, 2011: 942-951. Zhang K and Kwok J T. Clustered Nystrm method for large scale manifold learning and dimension reduction[J]. IEEE Transactions on Neural Networks, 2010, 21(10): 1576-1587. Vlachou A, Doulkeridis C, Kotidis Y, et al.. Monochromatic and bichromatic reverse top-k queries[J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(8): 1215-1229. -
計(jì)量
- 文章訪問數(shù): 1450
- HTML全文瀏覽量: 134
- PDF下載量: 806
- 被引次數(shù): 0