利用深度強化學習的多階段博弈網(wǎng)絡拓撲欺騙防御方法

何威振; 譚晶磊; 張帥; 程國振; 張帆; 郭云飛

doi:10.11999/JEIT240029

利用深度強化學習的多階段博弈網(wǎng)絡拓撲欺騙防御方法

doi: 10.11999/JEIT240029 cstr: 32379.14.JEIT240029

何威振¹,
譚晶磊^1, ,,
張帥¹,
程國振^{1, 2},
張帆¹,
郭云飛¹

1.
信息工程大學信息技術研究所鄭州 450001
2.
網(wǎng)絡空間安全教育部重點實驗室鄭州 450001

基金項目: 河南省重大科技專項(221100211200)

詳細信息

作者簡介:
何威振：男，博士生，研究方向為網(wǎng)絡欺騙、博弈論、人工智能

譚晶磊：男，講師，研究方向為網(wǎng)絡欺騙防御、智能博弈

張帥：男，助理研究員，研究方向為云計算安全、人工智能

程國振：男，副教授，研究方向為網(wǎng)絡主動防御、微服務安全、人工智能

張帆：男，研究員，研究方向為網(wǎng)絡安全、大數(shù)據(jù)安全

郭云飛：男，教授，研究方向為網(wǎng)絡安全、云計算安全、人工智能

通訊作者:
譚晶磊　nxutjl@126.com

計量
- 文章訪問數(shù): 385
- HTML全文瀏覽量: 194
- PDF下載量: 66
- 被引次數(shù): 0
出版歷程
- 收稿日期: 2024-01-19
- 修回日期: 2024-11-06
- 網(wǎng)絡出版日期: 2024-11-12
- 刊出日期: 2024-12-01

Multi-Stage Game-based Topology Deception Method Using Deep Reinforcement Learning

1.
Institute of Information Technology Research, Information Engineering University, Zhengzhou 450001, China
2.
Key Laboratory of Cyberspace Security, Ministry of Education, Zhengzhou 450001, China

Funds: The Major Science and Technology Project of Henan Province in China (221100211200)

摘要

摘要: 針對當前網(wǎng)絡拓撲欺騙防御方法僅從空間維度進行決策，沒有考慮云原生網(wǎng)絡環(huán)境下如何進行時空多維度拓撲欺騙防御的問題，該文提出基于深度強化學習的多階段Flipit博弈網(wǎng)絡拓撲欺騙防御方法來混淆云原生網(wǎng)絡中的偵察攻擊。首先分析了云原生網(wǎng)絡環(huán)境下的拓撲欺騙攻防模型，接著在引入折扣因子和轉移概率的基礎上，構建了基于Flipit的多階段博弈網(wǎng)絡拓撲欺騙防御模型。在分析博弈攻防策略的前提下，構建了基于深度強化學習的拓撲欺騙生成方法求解多階段博弈模型的拓撲欺騙防御策略。最后，通過搭建實驗環(huán)境，驗證了所提方法能夠有效建模分析云原生網(wǎng)絡的拓撲欺騙攻防場景，且所提算法相比于其他算法具有明顯的優(yōu)勢。
- 云原生網(wǎng)絡 /
- 拓撲欺騙 /
- 多階段Flipit博弈 /
- 深度強化學習 /
- 深度確定性策略梯度算法
Abstract: Aiming at the problem that current network topology deception methods only make decisions in the spatial dimension without considering how to perform spatio-temporal multi-dimensional topology deception in cloud-native network environments, a multi-stage Flipit game topology deception method with deep reinforcement learning to obfuscate reconnaissance attacks in cloud-native networks. Firstly, the topology deception defense-offense model in cloud-native complex network environments is analyzed. Then, by introducing a discount factor and transition probabilities, a multi-stage game-based network topology deception model based on Flipit is constructed. Furthermore under the premise of analyzing the defense-offense strategies of game models, a topology deception generation method is developed based on deep reinforcement learning to solve the topology deception strategy of multi-stage game models. Finally, through experiments, it is demonstrated that the proposed method can effectively model and analyze the topology deception defense-offense scenarios in cloud-native networks. It is shown that the algorithm has significant advantages compared to other algorithms.
- Cloud-native network /
- Topology deception /
- Multi-stage Flipit game /
- Deep reinforcement learning /
- Deep deterministic policy gradient

HTML全文

圖 1 Flipit博弈攻防對抗過程

下載: 全尺寸圖片幻燈片

圖 2 多階段Flipit博弈模型

下載: 全尺寸圖片幻燈片

圖 3 基于DDPG的最佳網(wǎng)絡拓撲欺騙算法框架圖

下載: 全尺寸圖片幻燈片

圖 4 基于Kubernetes的微服務實驗環(huán)境

下載: 全尺寸圖片幻燈片

圖 5 多階段博弈攻防過程轉移圖

下載: 全尺寸圖片幻燈片

圖 6 單階段博弈過程仿真結果

下載: 全尺寸圖片幻燈片

圖 7 不同學習率情況下，收益與迭代步數(shù)的關系

下載: 全尺寸圖片幻燈片

圖 8 不同階段下防御者的收益

下載: 全尺寸圖片幻燈片

圖 9 不同階段下防御者的拓撲欺騙變換策略

下載: 全尺寸圖片幻燈片

圖 10 算法對比

下載: 全尺寸圖片幻燈片

圖 11 偵察掃描結果

下載: 全尺寸圖片幻燈片

圖 12 不同變換策略下防御者的收益

下載: 全尺寸圖片幻燈片

1 基于DDPG算法的最佳網(wǎng)絡拓撲欺騙方法

Input：當前網(wǎng)絡環(huán)境中攻擊者的偵察策略$ \lambda_{k} $和防御者采取變換　欺騙策略產(chǎn)生的開銷$ C_{{\mathrm{D}}}^{k} $
Output：防御者最佳的網(wǎng)絡拓撲欺騙策略$ T_{k} $
1 初始化經(jīng)驗回訪池$D \leftarrow \varnothing $，初始化Actor網(wǎng)絡參數(shù)$ \theta $，初始化　Critic網(wǎng)絡參數(shù)$ \varphi $
2 對Target Actor網(wǎng)絡參數(shù)和Target Critic網(wǎng)絡參數(shù)進行賦值，　即$ \theta^{\prime} \leftarrow \theta_{,} \varphi^{\prime} \leftarrow \varphi $
3 for epi=1, 2, ···, M do //不斷迭代，訓練神經(jīng)網(wǎng)絡直至收斂
4 　初始化環(huán)境狀態(tài)$ {\boldsymbol{s}}_{0} $和隨機噪聲$ {\boldsymbol{n}} $
5 　初始化狀態(tài)行為軌跡$\tau \leftarrow \varnothing $
6 　for t = 1, 2, ···, K do //不斷迭代獲取經(jīng)驗回放數(shù)據(jù)，根據(jù) 　　存儲的經(jīng)驗回放數(shù)據(jù)對神經(jīng)網(wǎng)絡訓練
7 　　獲取當前狀態(tài)$ {\boldsymbol{s}}_{t} $//獲取當前攻擊者在階段$k$時網(wǎng)絡環(huán)境中　　　采取的策略$ \lambda_{k} $
8 　　從Actor網(wǎng)絡中輸出行為$ {\boldsymbol{a}}_{t} $//選取防御者的拓撲欺騙策略　　　 $ T $
9 　　將行為${{\boldsymbol{a}}_t}$加上噪聲${{\boldsymbol{n}}_t}$輸入到環(huán)境中，得到獎勵${r_t}$和下一　　　狀態(tài)${{\boldsymbol{s}}_{t + 1}}$
10　將得到的軌跡存入經(jīng)驗回訪池$ \tau \leftarrow \tau \cup\left({\boldsymbol{s}}_{t}, {\boldsymbol{a}}_{t}, \tau_{t}\right) $
11　 end for
12　 $ D \in D \cup \tau $
13　從經(jīng)驗回訪池中采樣一定數(shù)量的軌跡值$ \left({\boldsymbol{s}}_{t}, {\boldsymbol{a}}_{t}, r_{t},{\boldsymbol{s}}_{t+1}\right) $ 　　　進行訓練
14　根據(jù)式(4)、式(6)更新Actor網(wǎng)絡參數(shù)$ \theta $和Critic網(wǎng)絡參數(shù)$ \varphi $
15 根據(jù)式(7)更新Target Actor和Target Critic網(wǎng)絡參數(shù)
16 end for
17 end

下載: 導出CSV

表 1 8個階段之間的轉移概率

階段跳變	$ {\boldsymbol{S}}_{1} \rightarrow {\boldsymbol{S}}_{2}^{t} $	$ {\boldsymbol{S}}_{1} \rightarrow {\boldsymbol{S}}_{0}^{3} $	$ {\boldsymbol{S}}_{1} \rightarrow {\boldsymbol{S}}_{0}^{6} $	$ {\boldsymbol{S}}_{2} \rightarrow {\boldsymbol{S}}_{0}^{3} $	$ {\boldsymbol{S}}_{2} \rightarrow {\boldsymbol{S}}_{0}^{8} $	$ {\boldsymbol{S}}_{3} \rightarrow {\boldsymbol{S}}_{0}^{4} $
跳變概率	$ \eta(2 \mid 1)=0.7 $	$ \eta(3 \mid 1)=0.2 $	$ \eta(6 \mid 1)=0.1 $	$ \eta(3 \mid 2)=0.7 $	$ \eta(8 \mid 2)=0.3 $	$ \eta(4 \mid 3)=0.6 $
階段跳變	$ {\boldsymbol{S}}_{3} \rightarrow {\boldsymbol{S}}_{0}^{5} $	$ {\boldsymbol{S}}_{3} \rightarrow {\boldsymbol{S}}_{0}^{8} $	$ {\boldsymbol{S}}_{4} \rightarrow s_{0}^{7} $	$ {\boldsymbol{S}}_{4} \rightarrow {\boldsymbol{S}}_{0}^{2} $	$ {\boldsymbol{S}}_{4} \rightarrow {\boldsymbol{S}}_{0}^{1} $	$ {\boldsymbol{S}}_{5} \rightarrow {\boldsymbol{S}}_{0}^{7} $
跳變概率	$ \eta(5 \mid 3)=0.2 $	$ \eta(8 \mid 3)=0.2 $	$ \eta(7 \mid 4)=0.2 $	$ \eta {(2\|4) }=0.4 $	$ \eta(1 \mid 4)=0.4 $	$ \eta ( 7 \mid 5)=0.9 $
階段跳變	$ {\boldsymbol{S}}_{6} \rightarrow {\boldsymbol{S}}_{0}^{1} $	$ {\boldsymbol{S}}_{6} \rightarrow {\boldsymbol{S}}_{0}^{3} $	$ {\boldsymbol{S}}_{6} \rightarrow {\boldsymbol{S}}_{0}^{7} $	$ {\boldsymbol{S}}_{7} \rightarrow {\boldsymbol{S}}_{0}^{4} $	$ {\boldsymbol{S}}_{8} \rightarrow {\boldsymbol{S}}_{0}^{2} $	$ {\boldsymbol{S}}_{8} \rightarrow {\boldsymbol{S}}_{0}^{4} $
跳變概率	$ \eta(1 \mid 6)=0.2 $	$ \eta(3 \mid 6)=0.8 $	$ \eta(7 \mid 6)=0.8 $	$ \eta(4 \mid 7)=0.6 $	$ \eta(2 \mid 8)=0.9 $	$ \eta(4 \mid 8)=0.8 $

下載: 導出CSV

表 2 仿真參數(shù)設置

實驗參數(shù)	實驗參數(shù)的值
攻擊者的策略$ {\lambda} $	[0.01, 0.20]
防御者的策略$ T $	[1, 100]
防御者的開銷$ C_{{\mathrm{D}}} $	4
學習率	$2.5 \times {10^{ - 4}}$

下載: 導出CSV

表 3 DTG-DDPG與其他方法的定性比較

文獻	博弈模型	求解方法	攻防過程	決策目標	實時決策	實驗場景
Sayed等人^[7]	動態(tài)博弈	強化學習Q-learning	多階段	單目標空間策略	未考慮	NetworkX
Horák等人^[8]	隨機博弈	無	多階段	單目標空間策略	未考慮	傳統(tǒng)網(wǎng)絡
Milani等人^[9]	Stackelberg博弈	神經(jīng)架構搜索算法	單階段	單目標空間策略	未考慮	傳統(tǒng)網(wǎng)絡
Wang等人^[10]	馬爾可夫決策過程	強化學習Q-learning	多階段	多目標時空策略	考慮	傳統(tǒng)網(wǎng)絡
Li等人^[11]	馬爾可夫決策過程	深度強化學習PPO	多階段	多目標時空策略	考慮	云計算網(wǎng)絡
DTG-DDPG	Flipit博弈	深度強化學習DDPG	多階段	多目標時空策略	考慮	云原生網(wǎng)絡

下載: 導出CSV

參考文獻(18)

[1]	DUAN Qiang. Intelligent and autonomous management in cloud-native future networks—A survey on related standards from an architectural perspective[J]. Future Internet, 2021, 13(2): 42. doi: 10.3390/fi13020042.
[2]	ARMITAGE J. Cloud Native Security Cookbook[M]. O’Reilly Media, Inc. , 2022: 15–20.
[3]	T?RNEBERG W, SKARIN P, GEHRMANN C, et al. Prototyping intrusion detection in an industrial cloud-native digital twin[C]. 2021 22nd IEEE International Conference on Industrial Technology, Valencia, Spain, 2021: 749–755. doi: 10.1109/ICIT46573.2021.9453553.
[4]	STOJANOVI? B, HOFER-SCHMITZ K, and KLEB U. APT datasets and attack modeling for automated detection methods: A review[J]. Computers & Security, 2020, 92: 101734. doi: 10.1016/j.cose.2020.101734.
[5]	TRASSARE S T, BEVERLY R, and ALDERSON D. A technique for network topology deception[C]. 2013 IEEE Military Communications Conference, San Diego, USA, 2013: 1795–1800. doi: 10.1109/MILCOM.2013.303.
[6]	MEIER R, TSANKOV P, LENDERS V, et al. NetHide: Secure and practical network topology obfuscation[C]. 27th USENIX Conference on Security Symposium, Baltimore, USA, 2018: 693–709.
[7]	SAYED A, ANWAR A H, KIEKINTVELD C, et al. Honeypot allocation for cyber deception in dynamic tactical networks: A game theoretic approach[C]. 14th International Conference on Decision and Game Theory for Security, Avignon, France, 2023: 195–214. doi: 10.1007/978-3-031-50670-3_10.
[8]	HORáK K, ZHU Quanyan, and BO?ANSKY B. Manipulating adversary’s belief: A dynamic game approach to deception by design for proactive network security[C]. 8th International Conference on Decision and Game Theory for Security, Vienna, Austria, 2017: 273–294. doi: 10.1007/978-3-319-68711-7_15.
[9]	MILANI S, SHEN Weiran, CHAN K S, et al. Harnessing the power of deception in attack graph-based security games[C]. 11th International Conference on Decision and Game Theory for Security, College Park, USA, 2020: 147–167. doi: 10.1007/978-3-030-64793-3_8.
[10]	WANG Shuo, PEI Qingqi, WANG Jianhua, et al. An intelligent deployment policy for deception resources based on reinforcement learning[J]. IEEE Access, 2020, 8: 35792–35804. doi: 10.1109/ACCESS.2020.2974786.
[11]	LI Huanruo, GUO Yunfei, HUO Shumin, et al. Defensive deception framework against reconnaissance attacks in the cloud with deep reinforcement learning[J]. Science China Information Sciences, 2022, 65(7): 170305. doi: 10.1007/s11432-021-3462-4.
[12]	KANG M S, GLIGOR V D, and SEKAR V. SPIFFY: Inducing cost-detectability tradeoffs for persistent link-flooding attacks[C]. 23rd Annual Network and Distributed System Security Symposium, San Diego, USA, 2016: 53–55.
[13]	KIM J, NAM J, LEE S, et al. BottleNet: Hiding network bottlenecks using SDN-based topology deception[J]. IEEE Transactions on Information Forensics and Security, 2021, 16: 3138–3153. doi: 10.1109/TIFS.2021.3075845.
[14]	VAN DIJK M, JUELS A, OPREA A, et al. FlipIt: The game of “stealthy takeover”[J]. Journal of Cryptology, 2013, 26(4): 655–713. doi: 10.1007/s00145-012-9134-5.
[15]	DORASZELSKI U and ESCOBAR J F. A theory of regular Markov perfect equilibria in dynamic stochastic games: Genericity, stability, and purification[J]. Theoretical Economics, 2010, 5(3): 369–402. doi: 10.3982/TE632.
[16]	NILIM A and GHAOUI L E. Robust control of Markov decision processes with uncertain transition matrices[J]. Operations Research, 2005, 53(5): 780–798. doi: 10.1287/opre.1050.0216.
[17]	張勇, 譚小彬, 崔孝林, 等. 基于Markov博弈模型的網(wǎng)絡安全態(tài)勢感知方法[J]. 軟件學報, 2011, 22(3): 495–508. doi: 10.3724/SP.J.1001.2011.03751. ZHANG Yong, TAN Xiaobin, CUI Xiaolin, et al. Network security situation awareness approach based on Markov game model[J]. Journal of Software, 2011, 22(3): 495–508. doi: 10.3724/SP.J.1001.2011.03751.
[18]	China national vulnerability database of information security[DB/OL]. https://www.cnnvd.org.cn/home/aboutUs, 2015.