SMCA：基于芯粒集成的存算一體加速器擴展框架

李雯; 王穎; 何銀濤; 鄒凱偉; 李華偉; 李曉維

doi:10.11999/JEIT240284

SMCA：基于芯粒集成的存算一體加速器擴展框架

doi: 10.11999/JEIT240284 cstr: 32379.14.JEIT240284

李雯^{1, 2, 3},
王穎^{4, 5, ,},
何銀濤^{4, 5},
鄒凱偉⁶,
李華偉^{4, 5},
李曉維^{4, 5}

1.
山西大學(xué)計算機與信息技術(shù)學(xué)院(大數(shù)據(jù)學(xué)院) 太原 030006
2.
山西大學(xué)大數(shù)據(jù)科學(xué)與產(chǎn)業(yè)研究院太原 030006
3.
山西大學(xué)計算智能與中文信息處理教育部重點實驗室太原 030006
4.
中國科學(xué)院計算技術(shù)研究所處理器芯片全國重點實驗室北京 100190
5.
中國科學(xué)院大學(xué) 北京 100190
6.
清華大學(xué)電子工程系北京 100084

基金項目: 國家自然科學(xué)基金(62302283)，山西省基礎(chǔ)研究計劃項目(自由探索類)(202303021212015)

詳細信息

作者簡介:
李雯：女，講師，研究方向為容錯計算和集成電路設(shè)計

王穎：男，研究員，研究方向為新型EDA、處理器與存儲系統(tǒng)體系結(jié)構(gòu)

何銀濤：女，博士生，研究方向為存算一體芯片、專用處理器設(shè)計

鄒凱偉：女，博士后，研究方向為智能芯片設(shè)計

李華偉：女，研究員，研究方向為VLSI測試、容錯計算

李曉維：男，研究員，研究方向為硬件安全、集成電路設(shè)計自動化

通訊作者:
王穎　wangying2009@ict.ac.cn

中圖分類號: TN40; TP389.1
計量
- 文章訪問數(shù): 580
- HTML全文瀏覽量: 269
- PDF下載量: 91
- 被引次數(shù): 0
出版歷程
- 收稿日期: 2024-04-16
- 修回日期: 2024-09-13
- 網(wǎng)絡(luò)出版日期: 2024-09-30
- 刊出日期: 2024-11-01

SMCA: A Framework for Scaling Chiplet-Based Computing-in-Memory Accelerators

LI Wen^{1, 2, 3},
WANG Ying^{4, 5
, ,},
HE Yintao^{4, 5},
ZOU Kaiwei⁶,
LI Huawei^{4, 5},
LI Xiaowei^{4, 5}

1.
School of Computer and Information Technology (School of Big Data), Shanxi University Taiyuan 030006, China
2.
Institute of Big Data Science and Industry, Shanxi University Taiyuan 030006, China
3.
Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University Taiyuan 030006, China
4.
State Key Laboratory of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
5.
University of Chinese Academy of Sciences, BeiJing 100190, China
6.
Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

Funds: The National Natural Science Foundation of China (62302283), The Basic Research Program of Shanxi Province (Exploration Research)(202303021212015)

摘要

摘要: 基于可變電阻式隨機存取存儲器(ReRAM)的存算一體芯片已經(jīng)成為加速深度學(xué)習(xí)應(yīng)用的一種高效解決方案。隨著智能化應(yīng)用的不斷發(fā)展，規(guī)模越來越大的深度學(xué)習(xí)模型對處理平臺的計算和存儲資源提出了更高的要求。然而，由于ReRAM器件的非理想性，基于ReRAM的大規(guī)模計算芯片面臨著低良率與低可靠性的嚴峻挑戰(zhàn)。多芯粒集成的芯片架構(gòu)通過將多個小芯粒封裝到單個芯片中，提高了芯片良率、降低了芯片制造成本，已經(jīng)成為芯片設(shè)計的主要發(fā)展趨勢。然而，相比于單片式芯片數(shù)據(jù)的片上傳輸，芯粒間的昂貴通信成為多芯粒集成芯片的性能瓶頸，限制了集成芯片的算力擴展。因此，該文提出一種基于芯粒集成的存算一體加速器擴展框架—SMCA。該框架通過對深度學(xué)習(xí)計算任務(wù)的自適應(yīng)劃分和基于可滿足性模理論(SMT)的自動化任務(wù)部署，在芯粒集成的深度學(xué)習(xí)加速器上生成高能效、低傳輸開銷的工作負載調(diào)度方案，實現(xiàn)系統(tǒng)性能與能效的有效提升。實驗結(jié)果表明，與現(xiàn)有策略相比，SMCA為深度學(xué)習(xí)任務(wù)在集成芯片上自動生成的調(diào)度優(yōu)化方案可以降低35%的芯粒間通信能耗。
- 芯粒 /
- 深度學(xué)習(xí)處理器 /
- 存算一體 /
- 任務(wù)調(diào)度
Abstract: Computing-in-Memory (CiM) architectures based on Resistive Random Access Memory (ReRAM) have been recognized as a promising solution to accelerate deep learning applications. As intelligent applications continue to evolve, deep learning models become larger and larger, which imposes higher demands on the computational and storage resources on processing platforms. However, due to the non-idealism of ReRAM, large-scale ReRAM-based computing systems face severe challenges of low yield and reliability. Chiplet-based architectures assemble multiple small chiplets into a single package, providing higher fabrication yield and lower manufacturing costs, which has become a primary trend in chip design. However, compared to on-chip wiring, the expensive inter-chiplet communication becomes a performance bottleneck for chiplet-based systems which limits the chip’s scalability. As the countermeasure, a novel scaling framework for chiplet-based CiM accelerators, SMCA (SMT-based CiM chiplet Acceleration) is proposed in this paper. This framework comprises an adaptive deep learning task partition strategy and an automated SMT-based workload deployment to generate the most energy-efficient DNN workload scheduling strategy with the minimum data transmission on chiplet-based deep learning accelerators, achieving effective improvement in system performance and efficiency. Experimental results show that compared to existing strategies, the SMCA-generated automatically schedule strategy can reduce the energy costs of inter-chiplet communication by 35%.
- Chiplet /
- Deep learning processor /
- Computing-in-Memory (CiM) /
- Task dispatching

HTML全文

圖 1 在 ReRAM 交叉陣列上執(zhí)行卷積計算的示意圖

下載: 全尺寸圖片幻燈片

圖 2 SMCA 工作流程

下載: 全尺寸圖片幻燈片

圖 3 同構(gòu)存算一體芯粒集成的深度學(xué)習(xí)芯片架構(gòu)

下載: 全尺寸圖片幻燈片

圖 4 深度學(xué)習(xí)計算任務(wù)的平均劃分策略

下載: 全尺寸圖片幻燈片

圖 5 CAP 策略與 CMP 策略的對比

下載: 全尺寸圖片幻燈片

圖 6 歸一化的 NoP 能耗

下載: 全尺寸圖片幻燈片

圖 7 歸一化的 NoP 時延

下載: 全尺寸圖片幻燈片

圖 8 不同大小芯粒、不同規(guī)模系統(tǒng)的集成芯片上的 NoP 能耗對比

下載: 全尺寸圖片幻燈片

1 自適應(yīng)層級網(wǎng)絡(luò)劃分策略

　1: 輸入：單個芯粒的固定算力M；網(wǎng)絡(luò)$l({l_0},{l_1}, \cdots,{l_{L - 1}}) $的算力
　需求$w({w_0},{w_1}, \cdots ,{w_{L - 1}}) $。

　2: 輸出：網(wǎng)絡(luò)劃分策略bestP。

　3: ${C_{{\text{idle}}}}{{ = M}} $; /*初始化${C_{{\text{idle}}}} $*/

　4: for $i = 0,1, \cdots ,L - 1 $

　5: 　if ${C_{{\text{idle}}}} \ge {w_i} $ then

　6: 　　${\text{bestP}} \leftarrow {\text{NoPartition}}(i{\text{,}}{w_i}) $;

　7: 　else if $\left\lceil {\dfrac{{{w_i}}}{{{M}}} = = \dfrac{{{w_i} - {C_{{\text{idle}}}}}}{{{M}}}} \right\rceil $ then

　8: 　　${\text{bestP}} \leftarrow {\text{CMP}}(i{\text{,}}{w_i}) $;

　9: 　else

　10: 　 ${\text{bestP}} \leftarrow {\text{CAP}}(i{\text{,}}{w_i}) $;

　11: Update(${C_{{\text{idle}}}} $)

下載: 導(dǎo)出CSV

表 1 SMT約束中的符號表示

符號	含義
$ {T},{E},{C} $	計算任務(wù)集合，計算圖中邊的集合以及芯片封裝的芯粒集合
$ t,c $	計算任務(wù)$ t $，芯粒$ c $
$ {e}_{i,j} $	計算圖中，任務(wù)$ i $到任務(wù)$ j $的有向邊
$ {x}^{c},\;{y}^{c} $	芯粒$ c $在芯片上的$ \left(x,y\right) $坐標(biāo)
$ {w}^{t} $	任務(wù)$ t $的計算需求
$ {o}^{t} $	任務(wù)$ t $計算產(chǎn)生的中間數(shù)據(jù)量
$ {s}^{t} $	任務(wù)$ t $的開始執(zhí)行時間
$ q7j3ldu95^{t} $	完成任務(wù)t所有前置任務(wù)所需的芯粒間最小數(shù)據(jù)傳輸開銷
$ {\tau }^{t} $	任務(wù)$ t $的執(zhí)行時間
$ \mathrm{s}{\mathrm{w}}^{c} $	芯粒$ c $所在的波前編號
$ \mathrmq7j3ldu95\mathrm{i}\mathrm{s}({c}_{i},{c}_{j}) $	芯粒$ i $到芯粒$ j $的距離

下載: 導(dǎo)出CSV

表 2 系統(tǒng)配置

架構(gòu)層次	屬性	參數(shù)
封裝	頻率	1.8 GHz
	芯粒間互聯(lián)網(wǎng)絡(luò)帶寬	100 Gb/s
	芯粒間通信能耗	1.75 p/bit
芯粒	工藝制程	16 nm
	單個芯粒包含的計算核個數(shù)	16
	單個計算核包含的ReRAM交叉陣列個數(shù)	16
計算核	ReRAM交叉陣列大小	128$ \times $128
	ADC	1 bit
	DAC	8 bit
	一個ReRAM單元存儲的位數(shù)	2
	權(quán)重精度	8 bit
	數(shù)據(jù)流	權(quán)重固定型

下載: 導(dǎo)出CSV

參考文獻(24)

[1]	THOMPSON N C, GREENEWALD K, LEE K, et al. The computational limits of deep learning[EB/OL]. https://arxiv.org/abs/2007.05558, 2022.
[2]	HAN Yinhe, XU Haobo, LU Meixuan, et al. The big chip: Challenge, model and architecture[J]. Fundamental Research, 2023. doi: 10.1016/j.fmre.2023.10.020.
[3]	FENG Yinxiao and MA Kaisheng. Chiplet actuary: A quantitative cost model and multi-chiplet architecture exploration[C]. The 59th ACM/IEEE Design Automation Conference, San Francisco, USA, 2022: 121–126. doi: 10.1145/3489517.35304.
[4]	SHAFIEE A, NAG A, MURALIMANOHAR N, et al. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars[C]. 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture, Seoul, the Republic of Korea, 2016: 14–26. doi: 10.1109/ISCA.2016.12.
[5]	KRISHNAN G, GOKSOY A A, MANDAL S K, et al. Big-little chiplets for in-memory acceleration of DNNs: A scalable heterogeneous architecture[C]. 2022 IEEE/ACM International Conference on Computer Aided Design, San Diego, USA, 2022: 1–9.
[6]	LI Wen, WANG Ying, LI Huawei, et al. RRAMedy: Protecting ReRAM-based neural network from permanent and soft faults during its lifetime[C]. 2019 IEEE 37th International Conference on Computer Design (ICCD), Abu Dhabi, United Arab Emirates, 2019: 91–99. doi: 10.1109/ICCD46524.2019.00020.
[7]	AKINAGA H and SHIMA H. ReRAM technology; challenges and prospects[J]. IEICE Electronics Express, 2012, 9(8): 795–807. doi: 10.1587/elex.9.795.
[8]	IYER S S. Heterogeneous integration for performance and scaling[J]. IEEE Transactions on Components, Packaging and Manufacturing Technology, 2016, 6(7): 973–982. doi: 10.1109/TCPMT.2015.2511626.
[9]	SABAN K. Xilinx stacked silicon interconnect technology delivers breakthrough FPGA capacity, bandwidth, and power efficiency[R]. Virtex-7 FPGAs, 2011.
[10]	WADE M, ANDERSON E, ARDALAN S, et al. TeraPHY: A chiplet technology for low-power, high-bandwidth in-package optical I/O[J]. IEEE Micro, 2020, 40(2): 63–71. doi: 10.1109/MM.2020.2976067.
[11]	王夢迪, 王穎, 劉成, 等. Puzzle: 面向深度學(xué)習(xí)集成芯片的可擴展框架[J]. 計算機研究與發(fā)展, 2023, 60(6): 1216–1231. doi: 10.7544/issn1000-1239.202330059. WANG Mengdi, WANG Ying, LIU Cheng, et al. Puzzle: A scalable framework for deep learning integrated chips[J]. Journal of Computer Research and Development, 2023, 60(6): 1216–1231. doi: 10.7544/issn1000-1239.202330059.
[12]	KRISHNAN G, MANDAL S K, PANNALA M, et al. SIAM: Chiplet-based scalable in-memory acceleration with mesh for deep neural networks[J]. ACM Transactions on Embedded Computing Systems (TECS), 2021, 20(5s): 68. doi: 10.1145/3476999.
[13]	SHAO Y S, CEMONS J, VENKATESAN R, et al. Simba: Scaling deep-learning inference with chiplet-based architecture[J]. Communications of the ACM, 2021, 64(6): 107–116. doi: 10.1145/3460227.
[14]	TAN Zhanhong, CAI Hongyu, DONG Runpei, et al. NN-Baton: DNN workload orchestration and chiplet granularity exploration for multichip accelerators[C]. 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 2021: 1013–1026. doi: 10.1109/ISCA52012.2021.00083.
[15]	LI Wanqian, HAN Yinhe, and CHEN Xiaoming. Mathematical framework for optimizing crossbar allocation for ReRAM-based CNN accelerators[J]. ACM Transactions on Design Automation of Electronic Systems, 2024, 29(1): 21. doi: 10.1145/3631523.
[16]	GOMES W, KOKER A, STOVER P, et al. Ponte vecchio: A multi-tile 3D stacked processor for exascale computing[C]. 2022 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, USA, 2022: 42–44, doi: 10.1109/ISSCC42614.2022.9731673.
[17]	ZHU Haozhe, JIAO Bo, ZHANG Jinshan, et al. COMB-MCM: Computing-on-memory-boundary NN processor with bipolar bitwise sparsity optimization for scalable multi-chiplet-module edge machine learning[C]. 2022 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, USA, 2022: 1–3. doi: 10.1109/ISSCC42614.2022.9731657.
[18]	HWANG R, KIM T, KWON Y, et al. Centaur: A chiplet-based, hybrid sparse-dense accelerator for personalized recommendations[C]. 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 2020: 968–981. doi: 10.1109/ISCA45697.2020.00083.
[19]	SHARMA H, MANDAL S K, DOPPA J R, et al. SWAP: A server-scale communication-aware chiplet-based manycore PIM accelerator[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2022, 41(11): 4145–4156. doi: 10.1109/TCAD.2022.3197500.
[20]	何斯琪, 穆琛, 陳遲曉. 基于存算一體集成芯片的大語言模型專用硬件架構(gòu)[J]. 中興通訊技術(shù), 2024, 30(2): 37–42. doi: 10.12142/ZTETJ.202402006. HE Siqi, MU Chen, and CHEN Chixiao. Large language model specific hardware architecture based on integrated compute-in-memory chips[J]. ZTE Technology Journal, 2024, 30(2): 37–42. doi: 10.12142/ZTETJ.202402006.
[21]	CHEN Yiran, XIE Yuan, SONG Linghao, et al. A survey of accelerator architectures for deep neural networks[J]. Engineering, 2020, 6(3): 264–274. doi: 10.1016/j.eng.2020.01.007.
[22]	SONG Linghao, CHEN Fan, ZHUO Youwei, et al. AccPar: Tensor partitioning for heterogeneous deep learning accelerators[C]. 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), San Diego, USA, 2020: 342–355. doi: 10.1109/HPCA47549.2020.00036.
[23]	DE MOURA L and BJ?RNER N. Z3: An efficient SMT solver[C]. The 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, Budapest, Hungary, 2008: 337–340. doi: 10.1007/978-3-540-78800-3_24.
[24]	PAPAIOANNOU G I, KOZIRI M, LOUKOPOULOS T, et al. On combining wavefront and tile parallelism with a novel GPU-friendly fast search[J]. Electronics, 2023, 12(10): 2223. doi: 10.3390/electronics12102223.