一種面向多屬性不確定數(shù)據(jù)流的模體發(fā)現(xiàn)算法
doi: 10.11999/JEIT160247 cstr: 32379.14.JEIT160247
基金項目:
國家自然科學基金(61272011)
Motif Discovery Algorithm for Multiple Attributes Uncertain Data Stream
Funds:
The National Natural Science Foundation of China (61272011)
-
摘要: 該文針對多屬性不確定數(shù)據(jù)流的頻繁模式發(fā)現(xiàn)問題,借鑒生物信息學中的模體發(fā)現(xiàn)思想,提出了一種基于MEME(Multiple Expectation-maximization for Motif Elicitation)的多屬性不確定數(shù)據(jù)流模體發(fā)現(xiàn)算法。該算法根據(jù)不確定數(shù)據(jù)流的特點,設(shè)計了基于混合型模型的不確定滑動窗口更新計算方法,改進了SAX(Symbolic Aggregate approXimation)的符號化策略,提出了不同滑動窗口下多屬性模體的相似性分析方法。在實驗當中,用防空反導情報傳感器網(wǎng)絡(luò)中的一組不確定數(shù)據(jù)流驗證了其功能,通過植入不同數(shù)目的模體測試了其發(fā)現(xiàn)準確率,并在元組有效概率設(shè)置為1的條件下與已有算法進行了比較,結(jié)果表明:該算法可以較準確地發(fā)現(xiàn)多屬性不確定數(shù)據(jù)流中的頻繁模式。Abstract: Algorithm of motif discovery for multiple attributes uncertain data stream is proposed on the basis of MEME (Multiple Expectation-maximization for Motif Elicitation), which consults the thought of sequential pattern discovery in bioinformatics to solve the problem of frequent pattern discovery for multiple attributes uncertain data stream. A new method for update calculation of uncertain sliding window is designed based on mixed type model, SAX (Symbolic Aggregate approXimation) symbolic strategy is improved, and similarity analysis method for multiple attributes motifs under different sliding windows is put forward. The proposed algorithm is verified to be correct functionally by a set of uncertain data stream in the wireless sensor network of air and missile defense. Its accuracy is measured through planting different number of motifs. Furthermore, comparison with previous algorithm with tuples valid probability set to 1 shows that the proposed algorithm can discover frequent pattern for multiple attributes uncertain data stream precisely.
-
LEUNG C K S, JIANG F, and HAYDUK Y. A landmark- model based system for mining frequent patterns from uncertain data streams[C]. 2011 International Database Engineering and Applications Symposium, Lisbon, Portugal, 2011: 249-250. doi: 10.1145/2076623.2076659. CHUI C K and KAO B. A decremental approach for mining frequent itemsets from uncertain data[C]. 12th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Osaka, Japan, 2008: 64-75. doi: 10.1007/978-3-540-68125. LEUNG C K S, HAO B, and BRAJCZUK D A. Mining uncertain data for frequent itemsets that satisfy aggregate constraints[C]. 25th Annual ACM Symposium on Applied Computing, Sierre, Switzerland, 2010: 1034-1038. doi: 10.1145/1774088.1774305. LEUNG C K S and HAO B. Mining of frequent items from streams of uncertain data[C]. 25th IEEE International Conference on Data Engineering, Piscataway, NJ, USA, 2009: 1663-1670. doi: 10.1109/ICDE.2009.157. 湯克明. 不確定數(shù)據(jù)流中頻繁數(shù)據(jù)挖掘[D]. [博士論文], 南京航空航天大學, 2012. TANG Keming. Study on frequent data mining from uncertain data streams[D]. [Ph.D. dissertation], Nanjing University of Aeronautics and Astronautics, 2012. HEWANADUNGODAGE C, YUNI X, and LEE J J. Hyper-structure mining of frequent patterns in uncertain data streams[J]. Knowledge and Information Systems, 2013, 37: 219-244. doi: 10.1007/s10115-012-0581-y. LEUNG C K S, CUZZOCREA A, FAN J, et al. Discovering frequent patterns from uncertain data streams with time-fading and landmark models[J]. Transactions on Large-Scale Data and Knowledge-Centered Systems VIII, 2013: 174-196. doi: 10.1007/978-3-642-37574-3_8. 朱躍龍, 彭力, 李士進, 等. 水文時間序列模體挖掘[J]. 水利學報, 2012, 43(12): 1422-1430. ZHU Yuelong, PENG Li, LI Shijin, et al. Research on hydrological time series mining [J]. Journal of Hydraulic Engineering, 2012, 43(12): 1422-1430. 張懿璞. 轉(zhuǎn)錄因子結(jié)合位點識別問題的算法研究[D]. [博士論文], 西安電子科技大學, 2014. ZHANG Yipu. Algorithm research on the problem of transcription factor binging sites identification[D]. [Ph.D. dissertation], Xidian University, 2014. 楊矯云. 大規(guī)模生物序列分析的高性能算法和模型[D]. [博士論文], 中國科學技術(shù)大學, 2014. YANG Jiaoyun. High performance algorithms and models for large-scale biological sequence analysis[D]. [Ph.D. dissertation], University of Science and Technology of China, 2014. LIN J, KEOGH E, PATEL P, et al. Finding motifs in time series[C]. Proceedings of the 2nd Workshop on Temporal Data Mining at KDD, District of Colombia, USA, 2002: 53-68. CHIU B, KEOGH E, and LONARDI S. Probabilistic discovery of time series motifs[C]. 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, District of Colombia, USA, 2003: 493-498. doi: 10.1145/956750.956808. FERREIRA P G, AZEVEDO P J, SILVA C G, et al. Mining approximate motifs in time series[C]. 9th international conference on Discovery Science, Berlin, Germany, 2006: 89-101 . MUEEN A, KEOGH E, ZHU Q, et al. Exact discovery of time series motif[C]. 9th SIAM International Conference on Data Mining 2009, Nevada, USA, 2009: 469-480. ABDULLAH M and NIKAN C. Enumeration of time series motifs of all lengths[J]. Knowledge and Information Systems, 2015, 45: 105-132. doi: 10.1007/s10115-014-0793-4. 張懿璞, 霍紅衛(wèi), 于強, 等. 用于轉(zhuǎn)錄因子結(jié)合位點識別的定位投影求精算法[J]. 計算機學報, 2013, 36(12): 2545-2559. doi: 10.3724/SP.J.1016.2013.02545. ZHANG Yipu, HUO Hongwei, YU Q, et al. A novel fixed- position projection refinement algorithm for TFBS Identification[J]. Chinese Journal of Computers, 2013, 36(12): 2545-2559. doi: 10.3724/SP.J.1016.2013.02545. TIMOTHY L B. DREME: motif discovery in transcription factor ChIP-seq data[J]. Original Paper, 2011, 17(12): 1653-1659. doi: 10.1093/bioinformatics/btr261. DANIEL Q and XIE Xiaohui. EXTREME: an online EM algorithm for motif discovery[J]. Original Paper, 2014, 30(12): 1667-1673. doi: 10.1093/bioinformatics/btu093. THANH T L T, PENG Liping, DIAO Yanlei, et al. CLARO: modeling and processing uncertain data streams[J]. The VLDB Journal, 2012, 21: 651676. doi: 10.1007/s00778- 011-0261-7. ARCHAMBEAU C and VERLEYSEN M. Manifold constrained finite Gaussian mixtures [C]. 8th International Work Conference on Artificial Neural Networks, Berlin, Germany, 2005: 820828. MICHELE D. Modeling and querying data series and data streams with uncertainty[D]. [Ph.D. dissertation], University of Trento, 2014. HONG Y. On computing the distribution function for the sum of independent and non-identical random indicators [R]. Technical Report, Department of Statistics, Virginia Tech, 2011. 曲文龍, 張克君, 楊炳儒, 等. 基于奇異事件特征聚類的時間序列符號化方法[J]. 系統(tǒng)工程與電子技術(shù), 2006, 28(8): 11311134. QU Wenlong, ZHANG Kejun, YANG Bingru, et al. Time series symbolization based on singular event feature clustering[J]. Systems Engineering and Electronics, 2006, 28(8): 11311134. JESSICA L, EAMONN K, LI W, et al. Experiencing SAX: a novel symbolic representation of time series[J]. Data Minning and Knowledge Discovery, 2007, 15: 107144. doi: 10.1007/ s10618-007- 0064-z. -
計量
- 文章訪問數(shù): 1362
- HTML全文瀏覽量: 136
- PDF下載量: 498
- 被引次數(shù): 0