利用深度強化學習的多階段博弈網(wǎng)絡拓撲欺騙防御方法
doi: 10.11999/JEIT240029 cstr: 32379.14.JEIT240029
-
1.
信息工程大學信息技術研究所 鄭州 450001
-
2.
網(wǎng)絡空間安全教育部重點實驗室 鄭州 450001
Multi-Stage Game-based Topology Deception Method Using Deep Reinforcement Learning
-
1.
Institute of Information Technology Research, Information Engineering University, Zhengzhou 450001, China
-
2.
Key Laboratory of Cyberspace Security, Ministry of Education, Zhengzhou 450001, China
-
摘要: 針對當前網(wǎng)絡拓撲欺騙防御方法僅從空間維度進行決策,沒有考慮云原生網(wǎng)絡環(huán)境下如何進行時空多維度拓撲欺騙防御的問題,該文提出基于深度強化學習的多階段Flipit博弈網(wǎng)絡拓撲欺騙防御方法來混淆云原生網(wǎng)絡中的偵察攻擊。首先分析了云原生網(wǎng)絡環(huán)境下的拓撲欺騙攻防模型,接著在引入折扣因子和轉移概率的基礎上,構建了基于Flipit的多階段博弈網(wǎng)絡拓撲欺騙防御模型。在分析博弈攻防策略的前提下,構建了基于深度強化學習的拓撲欺騙生成方法求解多階段博弈模型的拓撲欺騙防御策略。最后,通過搭建實驗環(huán)境,驗證了所提方法能夠有效建模分析云原生網(wǎng)絡的拓撲欺騙攻防場景,且所提算法相比于其他算法具有明顯的優(yōu)勢。
-
關鍵詞:
- 云原生網(wǎng)絡 /
- 拓撲欺騙 /
- 多階段Flipit博弈 /
- 深度強化學習 /
- 深度確定性策略梯度算法
Abstract: Aiming at the problem that current network topology deception methods only make decisions in the spatial dimension without considering how to perform spatio-temporal multi-dimensional topology deception in cloud-native network environments, a multi-stage Flipit game topology deception method with deep reinforcement learning to obfuscate reconnaissance attacks in cloud-native networks. Firstly, the topology deception defense-offense model in cloud-native complex network environments is analyzed. Then, by introducing a discount factor and transition probabilities, a multi-stage game-based network topology deception model based on Flipit is constructed. Furthermore under the premise of analyzing the defense-offense strategies of game models, a topology deception generation method is developed based on deep reinforcement learning to solve the topology deception strategy of multi-stage game models. Finally, through experiments, it is demonstrated that the proposed method can effectively model and analyze the topology deception defense-offense scenarios in cloud-native networks. It is shown that the algorithm has significant advantages compared to other algorithms. -
1 基于DDPG算法的最佳網(wǎng)絡拓撲欺騙方法
Input:當前網(wǎng)絡環(huán)境中攻擊者的偵察策略$ \lambda_{k} $和防御者采取變換
欺騙策略產(chǎn)生的開銷$ C_{{\mathrm{D}}}^{k} $Output: 防御者最佳的網(wǎng)絡拓撲欺騙策略$ T_{k} $ 1 初始化經(jīng)驗回訪池$D \leftarrow \varnothing $,初始化Actor網(wǎng)絡參數(shù)$ \theta $,初始化
Critic網(wǎng)絡參數(shù)$ \varphi $2 對Target Actor網(wǎng)絡參數(shù)和Target Critic網(wǎng)絡參數(shù)進行賦值,
即$ \theta^{\prime} \leftarrow \theta_{,} \varphi^{\prime} \leftarrow \varphi $3 for epi=1, 2, ···, M do //不斷迭代,訓練神經(jīng)網(wǎng)絡直至收斂 4 初始化環(huán)境狀態(tài)$ {\boldsymbol{s}}_{0} $和隨機噪聲$ {\boldsymbol{n}} $ 5 初始化狀態(tài)行為軌跡$\tau \leftarrow \varnothing $ 6 for t = 1, 2, ···, K do //不斷迭代獲取經(jīng)驗回放數(shù)據(jù),根據(jù)
存儲的經(jīng)驗回放數(shù)據(jù)對神經(jīng)網(wǎng)絡訓練7 獲取當前狀態(tài)$ {\boldsymbol{s}}_{t} $//獲取當前攻擊者在階段$k$時網(wǎng)絡環(huán)境中
采取的策略$ \lambda_{k} $8 從Actor網(wǎng)絡中輸出行為$ {\boldsymbol{a}}_{t} $//選取防御者的拓撲欺騙策略
$ T $9 將行為${{\boldsymbol{a}}_t}$加上噪聲${{\boldsymbol{n}}_t}$輸入到環(huán)境中,得到獎勵${r_t}$和下一
狀態(tài)${{\boldsymbol{s}}_{t + 1}}$10 將得到的軌跡存入經(jīng)驗回訪池$ \tau \leftarrow \tau \cup\left({\boldsymbol{s}}_{t}, {\boldsymbol{a}}_{t}, \tau_{t}\right) $ 11 end for 12 $ D \in D \cup \tau $ 13 從經(jīng)驗回訪池中采樣一定數(shù)量的軌跡值$ \left({\boldsymbol{s}}_{t}, {\boldsymbol{a}}_{t}, r_{t},{\boldsymbol{s}}_{t+1}\right) $
進行訓練14 根據(jù)式(4)、式(6)更新Actor網(wǎng)絡參數(shù)$ \theta $和Critic網(wǎng)絡參數(shù)$ \varphi $ 15 根據(jù)式(7)更新Target Actor和Target Critic網(wǎng)絡參數(shù) 16 end for 17 end 下載: 導出CSV
表 1 8個階段之間的轉移概率
階段跳變 $ {\boldsymbol{S}}_{1} \rightarrow {\boldsymbol{S}}_{2}^{t} $ $ {\boldsymbol{S}}_{1} \rightarrow {\boldsymbol{S}}_{0}^{3} $ $ {\boldsymbol{S}}_{1} \rightarrow {\boldsymbol{S}}_{0}^{6} $ $ {\boldsymbol{S}}_{2} \rightarrow {\boldsymbol{S}}_{0}^{3} $ $ {\boldsymbol{S}}_{2} \rightarrow {\boldsymbol{S}}_{0}^{8} $ $ {\boldsymbol{S}}_{3} \rightarrow {\boldsymbol{S}}_{0}^{4} $ 跳變概率 $ \eta(2 \mid 1)=0.7 $ $ \eta(3 \mid 1)=0.2 $ $ \eta(6 \mid 1)=0.1 $ $ \eta(3 \mid 2)=0.7 $ $ \eta(8 \mid 2)=0.3 $ $ \eta(4 \mid 3)=0.6 $ 階段跳變 $ {\boldsymbol{S}}_{3} \rightarrow {\boldsymbol{S}}_{0}^{5} $ $ {\boldsymbol{S}}_{3} \rightarrow {\boldsymbol{S}}_{0}^{8} $ $ {\boldsymbol{S}}_{4} \rightarrow s_{0}^{7} $ $ {\boldsymbol{S}}_{4} \rightarrow {\boldsymbol{S}}_{0}^{2} $ $ {\boldsymbol{S}}_{4} \rightarrow {\boldsymbol{S}}_{0}^{1} $ $ {\boldsymbol{S}}_{5} \rightarrow {\boldsymbol{S}}_{0}^{7} $ 跳變概率 $ \eta(5 \mid 3)=0.2 $ $ \eta(8 \mid 3)=0.2 $ $ \eta(7 \mid 4)=0.2 $ $ \eta {(2|4) }=0.4 $ $ \eta(1 \mid 4)=0.4 $ $ \eta ( 7 \mid 5)=0.9 $ 階段跳變 $ {\boldsymbol{S}}_{6} \rightarrow {\boldsymbol{S}}_{0}^{1} $ $ {\boldsymbol{S}}_{6} \rightarrow {\boldsymbol{S}}_{0}^{3} $ $ {\boldsymbol{S}}_{6} \rightarrow {\boldsymbol{S}}_{0}^{7} $ $ {\boldsymbol{S}}_{7} \rightarrow {\boldsymbol{S}}_{0}^{4} $ $ {\boldsymbol{S}}_{8} \rightarrow {\boldsymbol{S}}_{0}^{2} $ $ {\boldsymbol{S}}_{8} \rightarrow {\boldsymbol{S}}_{0}^{4} $ 跳變概率 $ \eta(1 \mid 6)=0.2 $ $ \eta(3 \mid 6)=0.8 $ $ \eta(7 \mid 6)=0.8 $ $ \eta(4 \mid 7)=0.6 $ $ \eta(2 \mid 8)=0.9 $ $ \eta(4 \mid 8)=0.8 $ 下載: 導出CSV
表 2 仿真參數(shù)設置
實驗參數(shù) 實驗參數(shù)的值 攻擊者的策略$ {\lambda} $ [0.01, 0.20] 防御者的策略$ T $ [1, 100] 防御者的開銷$ C_{{\mathrm{D}}} $ 4 學習率 $2.5 \times {10^{ - 4}}$ 下載: 導出CSV
表 3 DTG-DDPG與其他方法的定性比較
文獻 博弈模型 求解方法 攻防過程 決策目標 實時決策 實驗場景 Sayed等人[7] 動態(tài)博弈 強化學習Q-learning 多階段 單目標空間策略 未考慮 NetworkX Horák等人[8] 隨機博弈 無 多階段 單目標空間策略 未考慮 傳統(tǒng)網(wǎng)絡 Milani等人[9] Stackelberg博弈 神經(jīng)架構搜索算法 單階段 單目標空間策略 未考慮 傳統(tǒng)網(wǎng)絡 Wang等人[10] 馬爾可夫決策過程 強化學習Q-learning 多階段 多目標時空策略 考慮 傳統(tǒng)網(wǎng)絡 Li等人[11] 馬爾可夫決策過程 深度強化學習PPO 多階段 多目標時空策略 考慮 云計算網(wǎng)絡 DTG-DDPG Flipit博弈 深度強化學習DDPG 多階段 多目標時空策略 考慮 云原生網(wǎng)絡 下載: 導出CSV
-
[1] DUAN Qiang. Intelligent and autonomous management in cloud-native future networks—A survey on related standards from an architectural perspective[J]. Future Internet, 2021, 13(2): 42. doi: 10.3390/fi13020042. [2] ARMITAGE J. Cloud Native Security Cookbook[M]. O’Reilly Media, Inc. , 2022: 15–20. [3] T?RNEBERG W, SKARIN P, GEHRMANN C, et al. Prototyping intrusion detection in an industrial cloud-native digital twin[C]. 2021 22nd IEEE International Conference on Industrial Technology, Valencia, Spain, 2021: 749–755. doi: 10.1109/ICIT46573.2021.9453553. [4] STOJANOVI? B, HOFER-SCHMITZ K, and KLEB U. APT datasets and attack modeling for automated detection methods: A review[J]. Computers & Security, 2020, 92: 101734. doi: 10.1016/j.cose.2020.101734. [5] TRASSARE S T, BEVERLY R, and ALDERSON D. A technique for network topology deception[C]. 2013 IEEE Military Communications Conference, San Diego, USA, 2013: 1795–1800. doi: 10.1109/MILCOM.2013.303. [6] MEIER R, TSANKOV P, LENDERS V, et al. NetHide: Secure and practical network topology obfuscation[C]. 27th USENIX Conference on Security Symposium, Baltimore, USA, 2018: 693–709. [7] SAYED A, ANWAR A H, KIEKINTVELD C, et al. Honeypot allocation for cyber deception in dynamic tactical networks: A game theoretic approach[C]. 14th International Conference on Decision and Game Theory for Security, Avignon, France, 2023: 195–214. doi: 10.1007/978-3-031-50670-3_10. [8] HORáK K, ZHU Quanyan, and BO?ANSKY B. Manipulating adversary’s belief: A dynamic game approach to deception by design for proactive network security[C]. 8th International Conference on Decision and Game Theory for Security, Vienna, Austria, 2017: 273–294. doi: 10.1007/978-3-319-68711-7_15. [9] MILANI S, SHEN Weiran, CHAN K S, et al. Harnessing the power of deception in attack graph-based security games[C]. 11th International Conference on Decision and Game Theory for Security, College Park, USA, 2020: 147–167. doi: 10.1007/978-3-030-64793-3_8. [10] WANG Shuo, PEI Qingqi, WANG Jianhua, et al. An intelligent deployment policy for deception resources based on reinforcement learning[J]. IEEE Access, 2020, 8: 35792–35804. doi: 10.1109/ACCESS.2020.2974786. [11] LI Huanruo, GUO Yunfei, HUO Shumin, et al. Defensive deception framework against reconnaissance attacks in the cloud with deep reinforcement learning[J]. Science China Information Sciences, 2022, 65(7): 170305. doi: 10.1007/s11432-021-3462-4. [12] KANG M S, GLIGOR V D, and SEKAR V. SPIFFY: Inducing cost-detectability tradeoffs for persistent link-flooding attacks[C]. 23rd Annual Network and Distributed System Security Symposium, San Diego, USA, 2016: 53–55. [13] KIM J, NAM J, LEE S, et al. BottleNet: Hiding network bottlenecks using SDN-based topology deception[J]. IEEE Transactions on Information Forensics and Security, 2021, 16: 3138–3153. doi: 10.1109/TIFS.2021.3075845. [14] VAN DIJK M, JUELS A, OPREA A, et al. FlipIt: The game of “stealthy takeover”[J]. Journal of Cryptology, 2013, 26(4): 655–713. doi: 10.1007/s00145-012-9134-5. [15] DORASZELSKI U and ESCOBAR J F. A theory of regular Markov perfect equilibria in dynamic stochastic games: Genericity, stability, and purification[J]. Theoretical Economics, 2010, 5(3): 369–402. doi: 10.3982/TE632. [16] NILIM A and GHAOUI L E. Robust control of Markov decision processes with uncertain transition matrices[J]. Operations Research, 2005, 53(5): 780–798. doi: 10.1287/opre.1050.0216. [17] 張勇, 譚小彬, 崔孝林, 等. 基于Markov博弈模型的網(wǎng)絡安全態(tài)勢感知方法[J]. 軟件學報, 2011, 22(3): 495–508. doi: 10.3724/SP.J.1001.2011.03751.ZHANG Yong, TAN Xiaobin, CUI Xiaolin, et al. Network security situation awareness approach based on Markov game model[J]. Journal of Software, 2011, 22(3): 495–508. doi: 10.3724/SP.J.1001.2011.03751. [18] China national vulnerability database of information security[DB/OL]. https://www.cnnvd.org.cn/home/aboutUs, 2015. -