面向通用處理器芯粒架構(gòu)探索和評(píng)估的系統(tǒng)級(jí)模擬器
doi: 10.11999/JEIT240299 cstr: 32379.14.JEIT240299
-
1.
中國(guó)科學(xué)院計(jì)算技術(shù)研究所處理器芯片全國(guó)重點(diǎn)實(shí)驗(yàn)室 北京 100190
-
2.
中國(guó)科學(xué)院大學(xué)計(jì)算機(jī)科學(xué)與技術(shù)學(xué)院 北京 100049
-
3.
鄭州大學(xué)河南先進(jìn)技術(shù)研究院 鄭州 450003
A System-level Exploration and Evaluation Simulator for chiplet-based CPU
-
1.
State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
-
2.
School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China
-
3.
Henan Institute of Advanced Technology, Zhengzhou University, Zhengzhou 450003, China
-
摘要: 隨著摩爾定律的逐步失效,芯片制造工藝的提升愈發(fā)困難,芯片性能的提升面臨“面積墻”問題,chiplet(芯粒)技術(shù)開始被廣泛采用來解決此問題。然而,面向chiplet引入的架構(gòu)設(shè)計(jì)參數(shù),目前的體系結(jié)構(gòu)模擬器面臨新的挑戰(zhàn)。為了能夠探索chiplet架構(gòu)的特定設(shè)計(jì)參數(shù),現(xiàn)有工作通常只會(huì)為模擬器增加單一的功能,導(dǎo)致其難以用于探索多個(gè)參數(shù)對(duì)chiplet芯片的整體影響。為了能夠較為全面地探索和評(píng)估chiplet芯片架構(gòu),該文基于現(xiàn)有g(shù)em5模擬器實(shí)現(xiàn)了面向通用處理器芯粒架構(gòu)探索和評(píng)估的系統(tǒng)級(jí)模擬器(SEEChiplet)模擬器框架。首先,總結(jié)了現(xiàn)在chiplet芯片設(shè)計(jì)關(guān)注的3類設(shè)計(jì)參數(shù),包括:(1) 芯片cache系統(tǒng)設(shè)計(jì);(2) 封裝方式模擬;(3) chiplet間的互連網(wǎng)絡(luò)。其次,針對(duì)上述3類參數(shù):(1)設(shè)計(jì)并實(shí)現(xiàn)了私有末級(jí)緩存系統(tǒng),擴(kuò)大了cache系統(tǒng)設(shè)計(jì)空間;(2) 修改了gem5已有的全局目錄,以適配私有末級(jí)緩存(LLC)系統(tǒng);(3) 建模了兩種常見的chiplet封裝方式以及chiplet間互連網(wǎng)絡(luò)。最后,該文在SEEChiplet框架中進(jìn)行了系統(tǒng)級(jí)的模擬評(píng)估,在被測(cè)chiplet架構(gòu)通用處理器上運(yùn)行操作系統(tǒng)及PARSEC 3.0基準(zhǔn)測(cè)試程序,驗(yàn)證了SEEChiplet的功能,證明SEEChiplet可以對(duì)chiplet設(shè)計(jì)空間進(jìn)行探索和評(píng)估。
-
關(guān)鍵詞:
- 芯粒 /
- 設(shè)計(jì)空間探索 /
- 體系結(jié)構(gòu)模擬器 /
- 緩存系統(tǒng)
Abstract: As Moore’s Law comes to an end, it is more and more difficult to improve the chip manufacturing process, and chiplet technology has been widely adopted to improve the chip performance. However, new design parameters introduced into the chiplet architecture pose significant challenges to the computer architecture simulator. To fully support exploration and evaluation of chiplet architecture, System-level Exploration and Evaluation simulator for Chiplet (SEEChiplet), a framework based on gem5 simulator, is developed in this paper. Firstly, three design parameters concerned about chiplet chip design are summarized in this paper, including: (1) chiplet cache system design; (2) Packaging simulation; (3) Interconnection networks between chiplet. Secondly, in view of the above three design parameters, in this paper: (1) a new private last level cache system is designed and implemented to expand the cache system design space; (2) existing gem5 global directory is modified to adapt to new private Last Level Cache (LLC) system; (3) two common packaging methods of chiplet and inter-chiplet network are modeled. Finally, a chiplet-based processor is simulated with PARSEC 3.0 benchmark program running on it, which proves that SEEChiplet can explore and evaluate the design space of chiplet.-
Key words:
- Chiplet /
- Design space exploration /
- Computer architecture simulator /
- Cache system
-
表 1 眾核chiplet架構(gòu)設(shè)計(jì)空間
設(shè)計(jì)選項(xiàng) 參數(shù)數(shù)量 chiplet本身 處理器:指令集架構(gòu);順序執(zhí)行,亂序執(zhí)行;核心數(shù)量
cache系統(tǒng):cache 塊大?。籧ache容量;cache層級(jí);chiplet私有末級(jí)緩存,全局共享末級(jí)緩存等
chiplet數(shù)量chiplet互連架構(gòu) chiplet拓?fù)洌篗esh, IO-die等;路由算法
chiplet互連:連接帶寬;連接延遲;Router延遲
chiplet集成方式:MCM, 2.5D, 3D;chiplet與封裝基板或中介層(Interposer)間SERDES配置等下載: 導(dǎo)出CSV
表 2 現(xiàn)有chiplet研究工作
研究工作 基于的模擬器或模擬手段 研究?jī)?nèi)容 chiplet封裝方式 是否支持模擬運(yùn)行操作系統(tǒng) 是否開源 Meduza[16] PriME[24] chiplet cache系統(tǒng) 2.5D 否 否 文獻(xiàn)[17] gem5[25] chiplet cache系統(tǒng) 2.5D 否 否 文獻(xiàn)[18] Multi2Sim[26] chiplet cache系統(tǒng) 無線連接 否 否 文獻(xiàn)[19] gem5-X[27] chiplet cache 系統(tǒng) 無線連接 是 否 1-Update[20] SimFlex[28] chiplet cache系統(tǒng) 3D 否 是 SILO[21] 未提到 chiplet cache系統(tǒng) 3D 否 否 Kite[22] gem5 chiplet 拓?fù)?/td> 2D, 2.5D 是 否 HexaMesh[23] BookSim[29] chiplet 拓?fù)?/td> 2.5D 否 否 文獻(xiàn)[30] Swarm[31] chiplet 架構(gòu)性能 2D, 2.5D 否 否 文獻(xiàn)[32] gem5[24] chiplet 架構(gòu)模擬 無 是 是 DCRA[33] muchiSim[34] chiplet 架構(gòu)模擬 2D, 2.5D 否 是 文獻(xiàn)[35] FPGA chiplet 架構(gòu)模擬 2D 是 否 SMAPPIC[36] FPGA chiplet 架構(gòu)模擬 無 是 是 下載: 導(dǎo)出CSV
表 4 SEEChiplet模擬參數(shù)配置表
配置項(xiàng) 基本信息 CPU Timing CPU, X86指令集,3 GHz cache層級(jí)及相應(yīng)參數(shù)(其中容量等參數(shù)可以根據(jù)用戶需求配置) 3級(jí)cache, Inclusive,頻率同CPU
L1: 指令cache,數(shù)據(jù)cache;每個(gè)CPU核心一組;均為32 kB, 4路組相連
L2: 每個(gè)CPU一組;1MB, 8路組相連
L3:所有chiplet共享/chiplet內(nèi)部共享;32 MB, 16路組相連封裝方式 MCM, 2.5D:SERDES組件增加2個(gè)cycle, Router本身3個(gè)cycle chiplet拓?fù)?/td> 支持IO Die, Mesh架構(gòu) chiplet參數(shù) 每個(gè)chiplet可以有2, 4, 8, 16個(gè)核心 內(nèi)存 單通道DDR4, 8 GB, 2400 MT/s 下載: 導(dǎo)出CSV
表 5 不同末級(jí)緩存架構(gòu),chiplet內(nèi)外部請(qǐng)求分布
末級(jí)緩存組織形式 內(nèi)部請(qǐng)求比例(%) 外部請(qǐng)求比例(%) 請(qǐng)求總數(shù)量 chiplet私有末級(jí)緩存 78.4 21.6 458037 全局共享末級(jí)緩存 5.8 94.2 404340 下載: 導(dǎo)出CSV
表 6 SEEChiplet建模開銷總結(jié)
開銷來源 開銷總結(jié) chiplet私有末級(jí)緩存 代碼量:~ 1000 行
新增中間狀態(tài):12個(gè)
新增事件類型:11個(gè)
新增狀態(tài)轉(zhuǎn)移邏輯:30個(gè),和全局目錄及其他LLC進(jìn)行交互
新增虛通道:2個(gè),用于和全局目錄進(jìn)行交互全局目錄 代碼量:~600行
每行新增bit數(shù):64bit用于存放共享chiplet列表,
8 bit用于存放持有者chiplet ID
新增狀態(tài):1個(gè)基礎(chǔ)狀態(tài)S, 9個(gè)中間狀態(tài)
修改狀態(tài):M狀態(tài)以及相關(guān)處理邏輯
新增事件類型:10個(gè)新增狀態(tài)轉(zhuǎn)移邏輯:18個(gè),全局目錄轉(zhuǎn)發(fā)請(qǐng)求,響應(yīng)請(qǐng)求等
新增虛通道:2個(gè),用于和末級(jí)緩存進(jìn)行交互下載: 導(dǎo)出CSV
-
[1] MOORE G E. Cramming more components onto integrated circuits[J]. Electronics, 1965, 38(8): 114–117. [2] DENNARD R H, GAENSSLE F H, YU H N, et al. Design of ion-implanted MOSFET's with very small physical dimensions[J]. IEEE Journal of Solid-State Circuits, 1974, 9(5): 256–268. doi: 10.1109/JSSC.1974.1050511. [3] HAN Yinhe, XU Haobo, LU Meixuan, et al. The big chip: Challenge, model and architecture[J]. Fundamental Research, 2023, S2667325823003709. doi: 10.1016/j.fmre.2023.10.020. [4] CAI Jingwei, WU Zuotong, PENG Sen, et al. Gemini: Mapping and architecture co-exploration for large-scale DNN Chiplet accelerators[C]. 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Edinburgh, United Kingdom, 2024: 156–171. doi: 10.1109/HPCA57654.2024.00022. [5] 陳云霽, 蔡一茂, 汪玉, 等. 集成電路未來發(fā)展與關(guān)鍵問題—第347期"雙清論壇(青年)"學(xué)術(shù)綜述[J]. 中國(guó)科學(xué): 信息科學(xué), 2024, 54(1): 1–15. doi: 10.1360/SSI-2023-0356.CHEN Yunji, CAI Yimao, WANG Yu, et al. Integrated circuit technology: Future development and key issues–review of the 347th "Shuangqing Forum (Youth)"[J]. Scientia Sinica Informationis, 2024, 54(1): 1–15. doi: 10.1360/SSI-2023-0356. [6] 項(xiàng)少林, 郭茂, 蒲菠, 等. Chiplet技術(shù)發(fā)展現(xiàn)狀[J]. 科技導(dǎo)報(bào), 2023, 41(19): 113–131. doi: 10.3981/j.issn.1000-7857.2023.19.013.XIANG Shaolin, GUO Mao, PU Bo, et al. Overview of the development status of Chiplet technology[J]. Science & Technology Review, 2023, 41(19): 113–131. doi: 10.3981/j.issn.1000-7857.2023.19.013. [7] 厲佳瑤, 張琨, 潘權(quán). Chiplet技術(shù): 拓展芯片設(shè)計(jì)的新邊界[J]. 集成電路與嵌入式系統(tǒng), 2024, 24(2): 1–9.LI Jiayao, ZHANG Kun, and PAN Quan. Chiplet: Expanding the innovative boundaries of chip design[J]. Integrated Circuits and Embedded Systems, 2024, 24(2): 1–9. [8] MA Xiaohan, WANG Ying, WANG Yujie, et al. Survey on Chiplets: Interface, interconnect and integration methodology[J]. CCF Transactions on High Performance Computing, 2022, 4(1): 43–52. doi: 10.1007/s42514-022-00093-0. [9] SUGGS D, SUBRAMONY M, and BOUVIER D. The AMD “Zen 2” processor[J]. IEEE Micro, 2020, 40(2): 45–52. doi: 10.1109/MM.2020.2974217. [10] NAFFZIGER S, LEPAK K, PARASCHOU M, et al. 2.2 AMD Chiplet architecture for high-performance server and desktop products[C]. 2020 IEEE International Solid-State Circuits Conference - (ISSCC), San Francisco, USA, 2020: 44–45. doi: 10.1109/ISSCC19947.2020.9063103. [11] EVERS M, BARNES L, and CLARK M. The AMD next-generation “Zen 3” Core[J]. IEEE Micro, 2022, 42(3): 7–12. doi: 10.1109/MM.2022.3152788. [12] MUNGER B, WILCOX K, SNIDERMAN J, et al. Zen 4: The AMD 5nm 5.7GHz x86-64 microprocessor core[C]. 2023 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, USA, 2023: 38–39. doi: 10.1109/ISSCC42615.2023.10067540. [13] GIANOS C. Architecting for flexibility and value with next gen Intel? Xeon? processors[C]. 2023 IEEE Hot Chips 35 Symposium (HCS), Palo Alto, USA, 2023: 1–15. doi: 10.1109/HCS59251.2023.10254694. [14] ESPOSITO B. Intel Agilex? 9 direct RF-series FPGAs with integrated 64 Gsps data converters[C]. 2023 IEEE Hot Chips 35 Symposium (HCS), Palo Alto, USA, 2023: 1–35. doi: 10.1109/HCS59251.2023.10254707. [15] VENTANA MICRO. Veyron V1 data center-class RISC-V processor[C]. 2023 IEEE Hot Chips 35 Symposium (HCS), Palo Alto, USA, 2023: 1–16. doi: 10.1109/HCS59251.2023.10254710. [16] CHIRKOV G and WENTZLAFF D. Seizing the bandwidth scaling of on-package interconnect in a post-Moore’s law world[C]. Proceedings of the 37th International Conference on Supercomputing, Orlando, USA, 2023: 410–422. doi: 10.1145/3577193.3593702. [17] YANG Chongyi, ZHANG Zhendong, WANG Xiaohang, et al. Adaptive caching policies for Chiplet systems based on reinforcement learning[C]. 2023 IEEE International Symposium on Circuits and Systems (ISCAS), Monterey, USA, 2023: 1–5. doi: 10.1109/ISCAS46773.2023.10181966. [18] GADE S H, SINHA M, KUMAR M, et al. Scalable hybrid cache coherence using emerging links for Chiplet architectures[C]. 2022 35th International Conference on VLSI Design and 2022 21st International Conference on Embedded Systems (VLSID), Bangalore, India, 2022: 92–97. doi: 10.1109/VLSID2022.2022.00029. [19] MEDINA R, KEIN J, ANSALONI G, et al. System-level exploration of in-package wireless communication for multi-Chiplet platforms[C]. Proceedings of the 28th Asia and South Pacific Design Automation Conference, Tokyo, Japan, 2023: 561–566. doi: 10.1145/3566097.3567952. [20] ZHU Mingcan, SHAHAB A, KATSARAKIS A, et al. Invalidate or update? Revisiting coherence for tomorrow's cache hierarchies[C]. 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT), Atlanta, USA, 2021: 226–241. doi: 10.1109/PACT52795.2021.00024. [21] SHAHAB A, ZHU Mingcan, MARGARITOV A, et al. Farewell my shared LLC! A case for private die-stacked DRAM caches for servers[C]. 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Fukuoka, Japan, 2018: 559–572. doi: 10.1109/MICRO.2018.00052. [22] BHARADWAJ S, YIN Jieming, BECKMANN B, et al. Kite: A family of heterogeneous interposer topologies enabled via accurate interconnect modeling[C]. 2020 57th ACM/IEEE Design Automation Conference (DAC), San Francisco, USA, 2020: 1–6. doi: 10.1109/DAC18072.2020.9218539. [23] IFF P, BESTA M, CAVALCANTE M, et al. HexaMesh: Scaling to hundreds of Chiplets with an optimized Chiplet arrangement[C]. 2023 60th ACM/IEEE Design Automation Conference (DAC), San Francisco, USA, 2023: 1–6. doi: 10.1109/DAC56929.2023.10248006. [24] FU Yaosheng and WENTZLAFF D. PriME: A parallel and distributed simulator for thousand-core chips[C]. 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Monterey, USA, 2014: 116–125. doi: 10.1109/ISPASS.2014.6844467. [25] LOWE-POWER J, AHMAD A M, AKRAM A, et al. The gem5 simulator: Version 20.0+[EB/OL]. https://arxiv.org/abs/2007.03152, 2020. [26] UBAL R, JANG B, MISTRY P, et al. Multi2Sim: A simulation framework for CPU-GPU computing[C]. Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. Minneapolis, USA, 2012: 335–344. doi: 10.1145/2370816.2370865. [27] QURESHI Y M, SIMON W A, ZAPATER M, et al. gem5-X: A many-core heterogeneous simulation platform for architectural exploration and optimization[J]. ACM Transactions on Architecture and Code Optimization (TACO), 2021, 18(4): 44. doi: 10.1145/3461662. [28] HARDAVELLAS N, SOMOGYI S, WENISCH T F, et al. SimFlex: A fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture[J]. ACM SIGMETRICS Performance Evaluation Review, 2004, 31(4): 31–34. doi: 10.1145/1054907.1054914. [29] JIANG Nan, BECKER U D, MICHELOGIANNAKIS G, et al. A detailed and flexible cycle-accurate network-on-chip simulator[C]. 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Austin, USA, 2013: 86–96. doi: 10.1109/ISPASS.2013.6557149. [30] BRKI? I R and JEFFREY M C M. Disintegrating manycores: Which applications lose and why?[C]. Proceedings of the 16th International Workshop on Network on Chip Architectures, Toronto, Canada, 2023: 3–8. doi: 10.1145/3610396.3618090. [31] JEFFREY M C, SUBRAMANIAN S, YAN Cong, et al. A scalable architecture for ordered parallelism[C]. 2015 48th International Symposium on Microarchitecture (MICRO), Waikiki, USA, 2015: 228–241. doi: 10.1145/2830772.2830777. [32] ZHI Haocong, XU Xianuo, HAN Weijian, et al. A methodology for simulating multi-Chiplet systems using open-source simulators[C]. Proceedings of the Eight Annual ACM International Conference on Nanoscale Computing and Communication, New York, NY, USA, 2021: 18. doi: 10.1145/3477206.3477459. [33] ORENES-VERA M, TURECI E, MARTONOSI M, et al. DCRA: A distributed Chiplet-based reconfigurable architecture for irregular applications[EB/OL]. https://arxiv.org/abs/2311.15443, 2024. [34] ORENES-VERA M, TURECI E, MARTONOSI M, et al. MuchiSim: A simulation framework for design exploration of multi-chip Manycore systems[C]. Proceedings of the 2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Indianapolis, USA, 2024: 48–60. doi: 10.1109/ISPASS61541.2024.00015. [35] LI Xingyu. High-performance FPGA-accelerated Chiplet modeling[D]. [Master dissertation], University of California, Berkeley, 2022. [36] CHIRKOV G and WENTZLAFF D. SMAPPIC: Scalable multi-FPGA architecture prototype platform in the cloud[C]. Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vancouver, Canada, 2023: 733–746. doi: 10.1145/3575693.3575753. [37] ZHAN Xusheng, BAO Yungang, BIENIA C, et al. PARSEC3.0: A multicore benchmark suite with network stacks and SPLASH-2X[J]. ACM SIGARCH Computer Architecture News, 2017, 44(5): 1–16. doi: 10.1145/3053277.3053279. [38] HARDAVELLAS N, FERDMAN M, FALSAFI B, et al. Reactive NUCA: Near-optimal block placement and replication in distributed caches[J]. ACM SIGARCH Computer Architecture News, 2009, 37(3): 184–195. doi: 10.1145/1555815.1555779. [39] AWASTHI M, SUDAN K, BALASUBRAMONIAN R, et al. Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches[C]. 2009 IEEE 15th International Symposium on High Performance Computer Architecture, Raleigh, USA, 2009: 250–261. doi: 10.1109/HPCA.2009.4798260. [40] KIM C, BURGER D, and KECKLER S W, et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches[C]. Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, USA, 2002: 211–222. doi: 10.1145/605397.605420. [41] LI Chengeng, JIANG Fan, CHEN Shixi, et al. Accelerating cache coherence in Manycore processor through silicon photonic Chiplet[C]. Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design (ICCAD'22), San Diego, USA, 2022: 43. doi: 10.1145/3508352.3549338. [42] CUBERO-CASCANTE J, ZURSTRA?EN N, N?LLER J, et al. Parti-gem5: Gem5’s timing mode parallelised[C]. 23rd International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, Samos, Greece, 2023: 177–192. doi: 1 0.1007/978-3-031-46077-7_12. -