魔方派:面向全同態(tài)加密的存算模運算加速器設(shè)計

李冰; 劉懷駿; 張偉功

doi:10.11999/JEIT230349

魔方派:面向全同態(tài)加密的存算模運算加速器設(shè)計

doi: 10.11999/JEIT230349 cstr: 32379.14.JEIT230349

1.
首都師范大學交叉科學研究院北京 100048
2.
首都師范大學信息工程學院北京 100048

基金項目: 國家自然基金項目(62204164)

詳細信息

作者簡介:
李冰：女，博士，副研究員，研究方向為計算機系統(tǒng)結(jié)構(gòu)、存內(nèi)計算架構(gòu)芯片設(shè)計

劉懷駿：男，碩士生，研究方向為計算機體系結(jié)構(gòu)、存內(nèi)計算、憶阻器

張偉功：男，博士，研究員，研究方向為計算機體系結(jié)構(gòu)、高可靠嵌入式系統(tǒng)設(shè)計

通訊作者:
李冰　bing.li@cnu.edu.cn

中圖分類號: TN401; TP391
計量
- 文章訪問數(shù): 859
- HTML全文瀏覽量: 364
- PDF下載量: 132
- 被引次數(shù): 0
出版歷程
- 收稿日期: 2023-04-27
- 修回日期: 2023-08-18
- 網(wǎng)絡(luò)出版日期: 2023-08-24
- 刊出日期: 2023-09-27

M²PI: Processing-in-Memory Modular Computing Accelerator for Full Homomorphic Encryption

1.
Academy for Multidisciplinary Studies, Capital Normal University, Beijing 100048, China
2.
College of Information Engineering, Capital Normal University, Beijing 100048, China

Funds: The National Natural Science Foundation of China (62204164)

摘要

摘要: 全同態(tài)加密(FHE)由于其可以實現(xiàn)隱私數(shù)據(jù)的計算，大大提高了數(shù)據(jù)的安全性而在醫(yī)療診斷、云計算、機器學習等領(lǐng)域取得了廣泛的關(guān)注。但是全同態(tài)密碼高昂的計算代價阻礙了其廣泛應(yīng)用。即使經(jīng)過算法和軟件設(shè)計優(yōu)化，F(xiàn)HE全同態(tài)加密中一個整數(shù)明文的密文數(shù)據(jù)規(guī)?？梢赃_到56 MByte，端側(cè)生成的密鑰最大都會達到11 k Byte。密文以及密鑰數(shù)據(jù)規(guī)模過大引起嚴重的計算和訪存瓶頸。存內(nèi)計算(PIM)是一個解決該問題的有效方案，其完全消除了內(nèi)存墻的延遲和功耗問題，在端側(cè)計算大規(guī)模數(shù)據(jù)時更具優(yōu)勢。利用存內(nèi)計算加速全同態(tài)計算的工作已經(jīng)被廣泛研究，但是全同態(tài)加密端側(cè)的執(zhí)行過程由于耗時的模運算也面臨著執(zhí)行時間的瓶頸。該文分析了BFV方案加密、解密、密鑰生成操作中各個關(guān)鍵算子的計算開銷，發(fā)現(xiàn)模計算的計算開銷平均占比達到了41%，延遲占比中訪存占97%，因此，該文提出一個名為魔方派(M²PI)的基于靜態(tài)隨機存取存儲器(SRAM)存內(nèi)計算的模運算加速器設(shè)計。實驗結(jié)果表明，該文所提加速器相比CPU中模計算有1.77倍的計算速度提升以及32.76倍能量的節(jié)省。
- 全同態(tài)加密 /
- 存內(nèi)計算 /
- 模運算
Abstract: Fully Homomorphic Encryption (FHE) attracts emerging interests from the fields of medical diagnosis, cloud computing, machine learning, etc. because it can realize the calculation on encrypted data and improve significantly the security of private data in the cloud computing scenarios. However, the expensive computational cost of FHE prevents its wide application. Even after algorithm and software design optimization, the ciphertext data size of an integer plaintext in FHE reaches 56 MByte, and the secret key data size reaches 11 k Byte. The large size of ciphertext and key causes serious bottlenecks in computation and memory access. Processing-In-Memory (PIM) is an effective solution to this problem, which eliminates completely the efficiency and power problem of the memory wall, and enables the deployment of data-intensive of application to the edge side. The application of processing-in memory to accelerate fully homomorphic computing has been widely studied, but the execution of homomorphic encryption still faces the execution time bottleneck induced by time-consuming modular computing. The computational costs of various key operators in BFV encryption, decryption, and key generation operations are analyzed in this paper, and found that the average computational cost of modular computing reached 41%, with memory access accounting for 97%. A modular accelerator called Processing-In-Memory Modular(M²PI) based on Static Random-Access Memory(SRAM) array is proposed to optimize modular computing in full-homomorphic encryption. The experimental results show that the proposed work achieves 1.77 times speedup and 32.76 times energy saving compared to CPU.
- Fully Homomorphic Encryption(FHE) /
- Processing-In-Memory(PIM) /
- Modular computing

HTML全文

圖 1 全同態(tài)加密完整過程

下載: 全尺寸圖片幻燈片

圖 2 不同參數(shù)規(guī)模下各階段算子執(zhí)行時間對比

下載: 全尺寸圖片幻燈片

圖 3 優(yōu)化前后模計算延遲對比

下載: 全尺寸圖片幻燈片

圖 4 巴雷特模運算中計算和訪存時間對比

下載: 全尺寸圖片幻燈片

圖 5 用于模計算的 PIM 陣列

下載: 全尺寸圖片幻燈片

圖 6 按位邏輯計算

下載: 全尺寸圖片幻燈片

圖 7 多比特邏輯運算

下載: 全尺寸圖片幻燈片

圖 8 PIM 中實現(xiàn)計算

下載: 全尺寸圖片幻燈片

圖 10 優(yōu)化前后各計算過程執(zhí)行速度對比

下載: 全尺寸圖片幻燈片

圖 9 PIM 中執(zhí)行模運算與之前結(jié)果對比

下載: 全尺寸圖片幻燈片

圖 11 優(yōu)化前后 4 個計算過程執(zhí)行能量節(jié)省

下載: 全尺寸圖片幻燈片

算法 1　巴雷特模算法
計算：$r = c{\text{ mod }}q$ (q為n 位數(shù)，2ⁿ)
準備：模數(shù)q, 被模數(shù)c
(1) $mu = {2^{2n}}/q$
(2) ${t_1} = c > > n - 1$
(3) ${t_2} = {t_1} \times mu$
(4) ${t_3} = {t_2} > > n + 1$
(5) $ {r_1} = c{\text{ }}\% {\text{ }}{2^{n + 1}} $
(6) ${r_2} = ({t_3} \times q){\text{ \% }}{{\text{2}}^{n + 1}}$
(7) $r = {r_1} - {r_2}$
判斷：$r > q/2?(r - q):r$
返回 r

下載: 導出CSV

算法 2 PIM中執(zhí)行巴雷特模算法
計算：$r = c{\text{ mod }}q$ (q為n bit數(shù))
(1) $x = c > > n - 1$
(2) $a = x < < n - 1$
(3) $r = c - a$
判斷：$r > q/2?(r - q):r$
返回 r

下載: 導出CSV

算法 3 PIM 中實現(xiàn)減法
輸入：c(被減數(shù))，$na$(${{a}}$對應(yīng)的補碼)
輸出：${\text{Sub}}$(為最終減法結(jié)果)
(1) $t = (((c + na)' + c)' + ((c + na)' + na)')'$
(2) $b = ((t + b)' + (c + na)')'$
(3) ${\text{Sub}} = ((t + (t + b)')' + (b + (t + b)')')'$

下載: 導出CSV

表 1 實驗參數(shù)設(shè)置^[11,21]

參數(shù)	值
晶體管工藝	65 nm
單元類型	10T-SRAM
電源電壓(Vdd)	1.1 V
閾值電壓(V_th)	166.3941e-3 V
讀取能量	293.3 fJ
寫入能量	316.3 fJ
陣列大小	128×512
時鐘頻率	100 MHz

下載: 導出CSV

參考文獻(21)

[1]	RIVEST R L, SHAMIR A, and ADLEMAN L. A method for obtaining digital signatures and public-key cryptosystems[J]. Communications of the ACM, 1978, 21(2): 120–126. doi: 10.1145/359340.359342
[2]	GENTRY C. Fully homomorphic encryption using ideal lattices[C]. The Forty-First Annual ACM Symposium on Theory of Computing, Bethesda, USA, 2009: 169–178.
[3]	FAN Junfeng and VERCAUTEREN F. Somewhat practical fully homomorphic encryption[R]. Paper 2012/144, 2012.
[4]	MARCOLLA C, SUCASAS V, MANZANO M, et al. Survey on fully homomorphic encryption, theory, and applications[J]. Proceedings of the IEEE, 2022, 110(10): 1572–1609. doi: 10.1109/JPROC.2022.3205665
[5]	KIM S, KIM J, KIM M J, et al. BTS: An accelerator for bootstrappable fully homomorphic encryption[C]. The 49th Annual International Symposium on Computer Architecture, New York, USA, 2022: 711–725.
[6]	NEJATOLLAHI H, GUPTA S, IMANI M, et al. CryptoPIM: In-memory acceleration for lattice-based cryptographic hardware[C]. 2020 57th ACM/IEEE Design Automation Conference (DAC), San Francisco, USA, 2020: 1–6.
[7]	SAMARDZIC N, FELDMANN A, KRASTEV A, et al. F1: A fast and programmable accelerator for fully homomorphic encryption[C/OL]. MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, Greece, 2021: 238–252.
[8]	AHN J, HONG S, YOO S, et al. A scalable processing-in-memory accelerator for parallel graph processing[C]. The 42nd Annual International Symposium on Computer Architecture, Portland, USA, 2015: 105–117.
[9]	AGA S, JELOKA S, SUBRAMANIYAN A, et al. Compute caches[C]. 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), Austin, USA, 2017: 481–492.
[10]	CAO Zhengjun, WEI Ruizhong, and LIN Xiaodong. A fast modular reduction method[R]. Paper 2014/040, 2014.
[11]	CHEN Paiyu, PENG Xiaochen, and YU Shimeng. NeuroSim+: An integrated device-to-algorithm framework for benchmarking synaptic devices and array architectures[C]. 2017 IEEE International Electron Devices Meeting (IEDM), San Francisco, USA, 2017: 6.1. 1–6.1. 4.
[12]	HE Yuquan, QU Songyun, LIN Gangliang, et al. Processing-in-SRAM acceleration for ultra-low power visual 3D perception[C]. The 59th ACM/IEEE Design Automation Conference, San Francisco, USA, 2022: 295–300.
[13]	LI Dai, PAKALA A, and YANG Kaiyuan. MeNTT: A compact and efficient processing-in-memory number theoretic transform (NTT) accelerator[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2022, 30(5): 579–588. doi: 10.1109/TVLSI.2022.3151321
[14]	REIS D, TAKESHITA J, JUNG T, et al. Computing-in-memory for performance and energy-efficient homomorphic encryption[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2020, 28(11): 2300–2313. doi: 10.1109/TVLSI.2020.3017595
[15]	GUPTA S, CAMMAROTA R, and ROSING T Š. MemFHE: End-to-end computing with fully homomorphic encryption in memory[J]. ACM Transactions on Embedded Computing Systems, 2022, 30(5): 579−588.
[16]	Sarojaerabelli. py-FHE - A python library for fully homomorphic encryption[EB/OL]. https://github.com/sarojaerabelli/py-fhe, 2022.
[17]	ALBRECHT M, CHASE M, CHEN Hao, et al. Homomorphic encryption standard[M]. LAUTER K, DAI Wei, and LAINE K. Protecting Privacy Through Homomorphic Encryption. Cham: Springer, 2021: 31–62.
[18]	AKYEL K C, CHARLES H P, MOTTIN J, et al. DRC²: Dynamically reconfigurable computing circuit based on memory architecture[C]. 2016 IEEE International Conference on Rebooting Computing (ICRC), San Diego, USA, 2016: 1–8.
[19]	HASAN M, SAHA U K, HOSSAIN M S, et al. Low power design of a two bit mangitude comparator for high speed operation[C]. 2019 International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, India, 2019: 1–4.
[20]	TALATI N, GUPTA S, MANE P, et al. Logic design within memristive memories using memristor-aided loGIC (MAGIC)[J]. IEEE Transactions on Nanotechnology, 2016, 15(4): 635–650. doi: 10.1109/TNANO.2016.2570248
[21]	YANG Y, JEONG H, SONG S C, et al. Single bit-line 7T SRAM cell for near-threshold voltage operation with enhanced performance and energy in 14 nm FinFET technology[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2016, 63(7): 1023–1032. doi: 10.1109/TCSI.2016.2556118