滿足本地差分隱私的混合噪音感知的模糊C均值聚類算法
doi: 10.11999/JEIT241067 cstr: 32379.14.JEIT241067
-
1.
安徽理工大學(xué)計算機科學(xué)與工程學(xué)院 淮南 232001
-
2.
云南省服務(wù)計算重點實驗室(云南財經(jīng)大學(xué)) 昆明 650221
-
3.
浙江大學(xué)計算機科學(xué)與技術(shù)學(xué)院 杭州 310058
-
4.
華北電力大學(xué)控制與計算機工程學(xué)院 北京 102206
-
5.
安徽理工大學(xué)安全科學(xué)與工程學(xué)院 淮南 232001
Fuzzy C-Means Clustering Algorithm Based on Mixed Noise-aware under Local Differential Privacy
-
1.
School of Computer Science and Engineering, Anhui University of Science and Technology, Huainan 232001, China
-
2.
Key Laboratory of Service Computing, Yunnan University of Finance and Economics, Kunming 650221, China
-
3.
College of Computer Science and Technology, Zhejiang University, Hangzhou 310058, China
-
4.
School of Control and Computer Engineering, North China Electric Power University, Beijing 102206, China
-
5.
School of Safety Science and Engineering, Anhui University of Science and Technology, Huainan 232001, China
-
摘要: 在大數(shù)據(jù)和物聯(lián)網(wǎng)應(yīng)用中,本地差分隱私(LDP)技術(shù)用于保護聚類分析中的用戶隱私,但現(xiàn)有方法要么在LDP下交互式地進行聚類,需要消耗大量隱私預(yù)算,要么沒有同時考慮到聚類數(shù)據(jù)中蘊含的表示數(shù)據(jù)質(zhì)量的高斯噪音以及為滿足LDP保護的拉普拉斯噪音,致使聚類精度低下。同時,對于衡量用戶提交數(shù)據(jù)和簇心之間的距離選擇較為武斷,沒有充分利用到用戶提交的噪音數(shù)據(jù)中蘊含的噪音模式。為此,該文創(chuàng)新性地提出一種滿足LDP的混合噪音感知的模糊C均值聚類算法(mnFCM),該算法的主要思想是同時建模用戶上傳數(shù)據(jù)中蘊含的表示用戶質(zhì)量的高斯噪音以及為保護用戶數(shù)據(jù)注入的拉普拉斯噪音,進而設(shè)計出混合噪音感知的距離替代傳統(tǒng)的歐式距離,來衡量樣本數(shù)據(jù)與簇心間的相似性。特別地,在mnFCM中,該文首先設(shè)計了混合噪音感知的距離計算方法,在此基礎(chǔ)上給出算法新的目標函數(shù),并基于拉格朗日乘子法設(shè)計了求解方法,最后理論上分析了求解算法的收斂性。該文進一步理論分析了mnFCM的隱私、效用和復(fù)雜度,分析結(jié)果表明所提算法嚴格滿足LDP、相對于對比算法更接近非隱私下的簇心以及和非隱私算法具有接近的復(fù)雜度。在兩個真實數(shù)據(jù)集上的實驗結(jié)果表明,mnFCM在滿足LDP下,聚類精度提高了10%~15%。Abstract:
Objective In big data and Internet of Things (IoT) applications, clustering analysis of collected data is crucial for enhancing user experience. To mitigate privacy risks from using raw data directly, Local Differential Privacy (LDP) techniques are often employed. However, existing LDP clustering studies either require interactive execution, consuming significant privacy budgets, or fail to balance Gaussian noise in clustering data with Laplacian noise for LDP protection, resulting in low clustering accuracy. Moreover, distance metrics for similarity measurement are chosen arbitrarily without fully utilizing the noise characteristics of user-submitted noisy data. This study designs a hybrid noise-aware distance calculation method integrated into the fuzzy C-means clustering algorithm, effectively reducing noise impact on clustering results while protecting data privacy, ensuring both privacy security and clustering quality. It provides a robust solution for sensitive information processing in high-dimensional data environments. Methods This paper innovatively proposes a mixed noise-aware Fuzzy C-Means clustering algorithm (mnFCM) under LDP. The core idea is to model both Gaussian noise (representing data quality) and Laplacian noise (for data protection) in uploaded user data by constructing a more accurate mixed distribution model, and design a mixed noise-aware distance to replace Euclidean distance for measuring similarity between samples and cluster centers. Specifically, in mnFCM, this paper first designs a mixed noise-aware distance calculation method. On this basis, a new objective function for the algorithm is proposed, and a solution method is designed based on the Lagrange multiplier method. Finally, the convergence of the solution algorithm is theoretically analyzed. Results and Discussions The experimental results show that as the privacy budget ε increases, the performance of various clustering algorithms generally improves. Notably, mnFCM achieves at least a 8.5% improvement in accuracy compared to the state-of-the-art PrivPro algorithm ( Fig.1 ). This is because mnFCM innovatively considers both Gaussian noise (reflecting data quality) and Laplacian noise (for LDP protection), designing a hybrid noise-aware distance metric to enhance sample similarity measurement, thereby effectively protecting privacy while balancing clustering performance. Experiments on the fuzziness parameter m reveal that when m=2, all algorithms reach peak F-Measure values and lowest Entropy values (Fig.2 ), strongly validating m=2 as the optimal balance point for clustering effectiveness. Additionally, running time of mnFCM is 1.0 to 1.4 times that of the non-privacy-preserving Nopriv algorithm (Table 2 ), due to its refined noise processing mechanism. Ablation experiments demonstrate that the MixDis scheme achieves the best clustering performance on both NG and UW datasets (Fig.4 ), as it considers both Laplacian and Gaussian noise, making the clustered data more robust. Comparative analysis on the synthetic dataset Syn with other privacy-preserving clustering algorithms shows that DP-DPCL+ consistently outperforms DP-DPCL, and DPC+ consistently outperforms DPC (Fig.5 ). In addition, by varying the values of the four adjustable parameters—privacy budget ε, sample size N, dimension K, and cluster number C—it is evident that the mnFCM method outperforms other privacy protection schemes (Fig. 6 ).Conclusions This paper addresses the privacy protection issue in fuzzy clustering algorithms by simultaneously considering Gaussian noise (reflecting data quality) and Laplacian noise (for LDP protection), and innovatively proposes a mixed noise-aware fuzzy C-means clustering algorithm, mnFCM, satisfying LDP to balance privacy security and clustering quality. It designs a mixed noise-aware distance calculation method, formulates a new objective function, and solves it using the Lagrange multiplier method, while theoretically analyzing the algorithm’s convergence. Theoretical analysis shows that the algorithm strictly satisfies LDP, is closer to non-private cluster centroids compared to baseline algorithms, and has similar complexity to non-private algorithms. Experiments demonstrate that the algorithm improves clustering accuracy by 10%~15% on real datasets compared to baseline privacy-preserving algorithms. However, a limitation of this study is that the privacy budget calculation for Laplacian noise in the mixed noise setting may be influenced by Gaussian noise. In future research, the adaptive noise proportion allocation strategies, such as dynamically adjusting the weights of Gaussian/Laplacian noise, will be further explored to optimize the privacy-utility trade-off. -
表 1 常用符號列表
符號 符號含義 ε 隱私預(yù)算 C, N, K 簇、樣本數(shù)據(jù)以及樣本屬性的個數(shù) uci 樣本數(shù)據(jù)的劃分隸屬度 Dci 樣本數(shù)據(jù)與簇心間的混合噪音感知的距離 $ {x_i} $, pc 樣本數(shù)據(jù)i、簇c的質(zhì)心 $ {{\hat {\boldsymbol x}}_i} $ 加噪后的樣本數(shù)據(jù)i m, γ 模糊度參數(shù)、收斂終止參數(shù) $\tau $ 迭代閾值 下載: 導(dǎo)出CSV
1 mnFCM算法
輸入:樣本數(shù)據(jù)集X={x1,x2, ···, xN},模糊度參數(shù)m,收斂終止
參數(shù)γ,最大迭代次數(shù)τmax;輸出:樣本劃分隸屬度uci,簇心pc /*本地端*/ for i=1, 2, ··· , N do 調(diào)用Laplace機制對樣本數(shù)據(jù)進行加噪; 用戶將加噪后的樣本數(shù)據(jù)$ {\hat {\boldsymbol x}}_i$上傳給服務(wù)器; end for /*服務(wù)器端*/ 設(shè)置初始迭代次數(shù):τ =0; 隨機初始化uci使其滿足0≤uci≤1; 迭代開始: 根據(jù)式(22)更新pc; 根據(jù)式(12)更新Dci; 根據(jù)式(18)更新uci; 更新迭代次數(shù)τ= τ+1; 迭代終止條件 max|uci(τ)–uci(τ–1)|≤γ或者τ =τmax。 return uci, pc; 下載: 導(dǎo)出CSV
表 2 算法運行時間對比
對比算法 NG (min) UW(s) NoPriv 5 8.33 PrivDis 19 23.67 PrivPro 26 44.36 PrivKM 24 34.63 mnFCM 7 8.42 下載: 導(dǎo)出CSV
-
[1] MISTRY H K, MAVANI C, GOSWAMI A, et al. The impact of cloud computing and AI on industry dynamics and competition[J]. Educational Administration: Theory and Practice, 2024, 30(7): 797–804. doi: 10.53555/kuey.v30i7.6851. [2] 常璐瑤, 牛新征, 羅濤, 等. 基于子博弈完美均衡的啟發(fā)式聚類算法[J]. 電子學(xué)報, 2024, 52(3): 740–750. doi: 10.12263/DZXB.20221206.CHANG Luyao, NIU Xinzheng, LUO Tao, et al. Heuristic clustering algorithm based on sub-game perfect equilibrium[J]. Acta Electronica Sinica, 2024, 52(3): 740–750. doi: 10.12263/DZXB.20221206. [3] 黃鶴, 李文龍, 楊瀾, 等. 跳躍跟蹤SSA交叉迭代AP聚類算法[J]. 電子學(xué)報, 2024, 52(3): 977–990. doi: 10.12263/DZXB.20220209.HUANG He, LI Wenlong, YANG Lan, et al. Jump tracking SSA hybrid iterative AP clustering algorithm[J]. Acta Electronica Sinica, 2024, 52(3): 977–990. doi: 10.12263/DZXB.20220209. [4] 張強, 葉阿勇, 葉幗華, 等. 最優(yōu)聚類的k-匿名數(shù)據(jù)隱私保護機制[J]. 計算機研究與發(fā)展, 2022, 59(7): 1625–1635. doi: 10.7544/issn1000-1239.20210117.ZHANG Qiang, YE Ayong, YE Guohua, et al. k-Anonymous data privacy protection mechanism based on optimal clustering[J]. Journal of Computer Research and Development, 2022, 59(7): 1625–1635. doi: 10.7544/issn1000-1239.20210117. [5] LIN Wanyu, LI Baochun, and WANG Cong. Towards private learning on decentralized graphs with local differential privacy[J]. IEEE Transactions on Information Forensics and Security, 2022, 17: 2936–2946. doi: 10.1109/TIFS.2022.3198283. [6] 傅培旺, 丁紅發(fā), 劉海, 等. 基于本地差分隱私的分布式圖統(tǒng)計采集算法[J]. 計算機研究與發(fā)展, 2024, 61(7): 1643–1669. doi: 10.7544/issn1000-1239.202330628.FU Peiwang, DING Hongfa, LIU Hai, et al. Statistics collecting algorithms of distributed graph via local differential privacy[J]. Journal of Computer Research and Development, 2024, 61(7): 1643–1669. doi: 10.7544/issn1000-1239.202330628. [7] 李宗維, 孔德潮, 牛媛爭, 等. 基于人工智能和區(qū)塊鏈融合的隱私保護技術(shù)研究綜述[J]. 信息安全研究, 2023, 9(6): 557–565. doi: 10.12379/j.issn.2096-1057.2023.06.08.LI Zongwei, KONG Dechao, NIU Yuanzheng, et al. Towards a privacy-preserving research for AI and blockchain integration[J]. Journal of Information Security Research, 2023, 9(6): 557–565. doi: 10.12379/j.issn.2096-1057.2023.06.08. [8] LUO Yuling, WANG Zhangrui, ZHANG Shunsheng, et al. Efficient-secure k-means clustering guaranteeing personalized local differential privacy[C]. 22nd International Conference on Algorithms and Architectures for Parallel Processing, Copenhagen, Denmark, 2023: 660–675. doi: 10.1007/978-3-031-22677-9_35. [9] 張少波, 原劉杰, 毛新軍, 等. 基于本地差分隱私的K-modes聚類數(shù)據(jù)隱私保護方法[J]. 電子學(xué)報, 2022, 50(9): 2181–2188. doi: 10.12263/DZXB.20201374.ZHANG Shaobo, YUAN Liujie, MAO Xinjun, et al. Privacy protection method for K-modes clustering data with local differential privacy[J]. Acta Electronica Sinica, 2022, 50(9): 2181–2188. doi: 10.12263/DZXB.20201374. [10] XIA Chang, HUA Jingyu, TONG Wei, et al. Distributed K-Means clustering guaranteeing local differential privacy[J]. Computers & Security, 2020, 90: 101699. doi: 10.1016/j.cose.2019.101699. [11] LIN Aixin and MA Xuebin. PU_Bpub: High-dimensional data release mechanism based on spectral clustering with local differential privacy[C]. 17th International Conference on Wireless Algorithms, Systems, and Applications, Dalian, China, 2022: 572–581. doi: 10.1007/978-3-031-19214-2_48. [12] 張國鵬, 陳學(xué)斌, 王豪石, 等. 面向本地差分隱私的K-Prototypes聚類方法[J]. 計算機應(yīng)用, 2022, 42(12): 3813–3821. doi: 10.11772/j.issn.1001-9081.2021101724.ZHANG Guopeng, CHEN Xuebin, WANG Haoshi, et al. K-Prototypes clustering method for local differential privacy[J]. Journal of Computer Applications, 2022, 42(12): 3813–3821. doi: 10.11772/j.issn.1001-9081.2021101724. [13] LI Weiqing, CHEN Hongyu, ZHAO Ruifeng, et al. A federated recommendation system based on local differential privacy clustering[C]. 2021 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/IOP/SCI), Atlanta, USA, 2021: 364–369. doi: 10.1109/SWC50871.2021.00056. [14] LI Yong, SONG Xiao, TU Yuchun, et al. GAPBAS: Genetic algorithm-based privacy budget allocation strategy in differential privacy K-means clustering algorithm[J]. Computers & Security, 2024, 139: 103697. doi: 10.1016/j.cose.2023.103697. [15] LIU Chao, ZHI Zhaolong, ZHAO Weinan, et al. Research on local fingerprint image differential privacy protection method based on clustering algorithm and regression algorithm segmentation image[J]. IEEE Access, 2024, 12: 27127–27146. doi: 10.1109/ACCESS.2024.3363494. [16] 石江南, 彭長根, 譚偉杰. Spark框架下支持差分隱私保護的K-means++聚類方法[J]. 信息安全研究, 2024, 10(8): 712–718. doi: 10.12379/j.issn.2096-1057.2024.08.04.SHI Jiangnan, PENG Changgen, and TAN Weijie. K-means++ clustering method supporting differential privacy protection in spark framework[J]. Journal of Information Security Research, 2024, 10(8): 712–718. doi: 10.12379/j.issn.2096-1057.2024.08.04. [17] WU Fuyu, DU Mingjing, and ZHI Qiang. Density-based clustering with differential privacy[J]. Information Sciences, 2024, 681: 121211. doi: 10.1016/j.ins.2024.121211. [18] FANG Shuhui, WAN Xuejun, WANG Jun, et al. HiDS Data clustering algorithm based on differential privacy[C]. 2024 International Conference on Networking and Network Applications (NaNA), Yinchuan, China, 2024: 131–136. doi: 10.1109/NaNA63151.2024.00029. [19] DIAA A, HUMPHRIES T, and KERSCHBAUM F. FastLloyd: Federated, accurate, secure, and tunable k-means clustering with differential privacy[EB/OL]. https://arxiv.org/abs/2405.02437, 2024. [20] SONG Haina, HAN Xinyu, LV Jie, et al. MPLDS: An integration of CP-ABE and local differential privacy for achieving multiple privacy levels data sharing[J]. Peer-to-Peer Networking and Applications, 2022, 15(1): 369–385. doi: 10.1007/s12083-021-01238-8. [21] YANG Wenjun and AL-MASRI E. ULDP: A user-centric local differential privacy optimization method[C]. 2024 IEEE World AI IoT Congress (AIIoT), Seattle, USA, 2024: 0316–0322. doi: 10.1109/AIIoT61789.2024.10579023. [22] 曾卓, 汪成亮, 馬飛. 基于差分隱私的活動模式保護與時空軌跡發(fā)布方法[J]. 電子學(xué)報, 2023, 51(3): 552–563. doi: 10.12263/DZXB.20210631.ZENG Zhuo, WANG Chengliang, MA Fei. Differentially private activity pattern and spatial-temporal trajectory publication[J]. Acta Electronica Sinica, 2023, 51(3): 552–563. doi: 10.12263/DZXB.20210631. [23] HERNANDEZ-MATAMOROS A and KIKUCHI H. Comparative analysis of local differential privacy schemes in healthcare datasets[J]. Applied Sciences, 2024, 14(7): 2864. doi: 10.3390/app14072864. [24] DU Minxin, YUE Xiang, CHOW S S M, et al. Sanitizing sentence embeddings (and labels) for local differential privacy[C]. Proceedings of the ACM Web Conference 2023, New York, USA, 2023: 2349–2359. doi: 10.1145/3543507.3583512. [25] YANG Mengmeng, GUO Taolin, ZHU Tianqing, et al. Local differential privacy and its applications: A comprehensive survey[J]. Computer Standards & Interfaces, 2024, 89: 103827. doi: 10.1016/j.csi.2023.103827. [26] KRASNOV D, DAVIS D, MALOTT K, et al. Fuzzy C-means clustering: A review of applications in breast cancer detection[J]. Entropy, 2023, 25(7): 1021. doi: 10.3390/e25071021. [27] ALI N A, EL ABBASSI A, and BOUATTANE O. Performance evaluation of spatial fuzzy C-means clustering algorithm on GPU for image segmentation[J]. Multimedia Tools and Applications, 2023, 82(5): 6787–6805. doi: 10.1007/s11042-022-13635-z. [28] 徐久成, 侯欽臣, 瞿康林, 等. 面向時間序列的魯棒性半監(jiān)督模糊C均值聚類[J]. 計算機工程與應(yīng)用, 2023, 59(8): 73–80. doi: 10.3778/j.issn.1002-8331.2207-0445.XU Jiucheng, HOU Qinchen, QU Kanglin, et al. Robust semi-supervised fuzzy C-means clustering for time series[J]. Computer Engineering and Applications, 2023, 59(8): 73–80. doi: 10.3778/j.issn.1002-8331.2207-0445. [29] ARORA J, TUSHIR M, and DADHWAL S K. A new suppression-based possibilistic fuzzy c-means clustering algorithm[J]. EAI Endorsed Transactions on Scalable Information Systems, 2023, 10(3): e3. doi: 10.4108/eetsis.v10i3.2057. [30] FANG Siguo, HUANG Dong, CAI Xiaosha, et al. Efficient multi-view clustering via unified and discrete bipartite graph learning[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(8): 11436–11447. doi: 10.1109/TNNLS.2023.3261460. [31] JAEGER A and BANKS D. Cluster analysis: A modern statistical review[J]. WIREs Computational Statistics, 2023, 15(3): e1597. doi: 10.1002/wics.1597. [32] BESHARATNIA F, TALEBPOUR A, and ALIAKBARY S. An improved grey wolves optimization algorithm for dynamic community detection and data clustering[J]. Applied Artificial Intelligence, 2022, 36(1): 2012000. doi: 10.1080/08839514.2021.2012000. [33] ALEMAZKOOR N, TOOTKABONI M, NATEGHI R, et al. Smart-meter big data for load forecasting: An alternative approach to clustering[J]. IEEE Access, 2022, 10: 8377–8387. doi: 10.1109/ACCESS.2022.3142680. [34] ZHAO Wenhao, MA Jin, LIU Qiyuan, et al. Comparison and application of SOFM, fuzzy c-means and k-means clustering algorithms for natural soil environment regionalization in China[J]. Environmental Research, 2023, 216: 114519. doi: 10.1016/j.envres.2022.114519. [35] 葛麗娜, 陳圓圓, 王捷, 等. 改進的密度峰值聚類算法的差分隱私保護方案[J]. 鄭州大學(xué)學(xué)報: 工學(xué)版, 2023, 44(6): 19–24. doi: 10.13705/j.issn.1671-6833.2023.03.010.GE Lina, CHEN Yuanyuan, WANG Jie, et al. Differential privacy protection scheme of adaptive clustering by fast search and find of density peaks[J]. Journal of Zhengzhou University: Engineering Science, 2023, 44(6): 19–24. doi: 10.13705/j.issn.1671-6833.2023.03.010. [36] CHEN Hua, ZHOU Yuan, MEI Kehui, et al. A new density peak clustering algorithm with adaptive clustering center based on differential privacy[J]. IEEE Access, 2023, 11: 1418–1431. doi: 10.1109/ACCESS.2022.3233196. -