登录    注册    忘记密码

详细信息

SD2AC: A reinforcement learning framework using distribution evaluation and sequential decision-making for UCAV combat  ( SCI-EXPANDED收录 EI收录)  

文献类型:期刊文献

英文题名:SD2AC: A reinforcement learning framework using distribution evaluation and sequential decision-making for UCAV combat

作者:Yang, Tao[1,2];Shi, Xinhao[1,2];Xu, Cheng[1];Yang, Yulin[2];Liu, Hongzhe[1];Zeng, Qinghan[2]

第一作者:Yang, Tao

通讯作者:Xu, C[1]

机构:[1]Beijing Union Univ, Beijing Key Lab Informat Serv Engn, Beijing 100101, Peoples R China;[2]Sci & Technol Innovat Res Ctr ARI, Unit 32178 PLA, Beijing 100012, Peoples R China

第一机构:北京联合大学北京市信息服务工程重点实验室

通讯机构:[1]corresponding author), Beijing Union Univ, Beijing Key Lab Informat Serv Engn, Beijing 100101, Peoples R China.|[11417103]北京联合大学北京市信息服务工程重点实验室;[11417]北京联合大学;

年份:2025

卷号:12

期号:7

起止页码:96-112

外文期刊名:JOURNAL OF COMPUTATIONAL DESIGN AND ENGINEERING

收录:;EI(收录号:20252918815648);WOS:【SCI-EXPANDED(收录号:WOS:001526759000001)】;

基金:The authors would like to express their sincere gratitude to the Beijing Key Laboratory of Information Service Engineering and the Science and Technology Innovation Research Center of ARI, Unit 32178 of the PLA, for their continuous technical support and valuable resources during the preparation of this manuscript.

语种:英文

外文关键词:multi-agent reinforcement learning; multi-agent system; air combat; q-learning; soft actor-critic; UCAV

摘要:The increasing complexity of autonomous drone swarm operations, particularly in adversarial air combat scenarios, presents significant challenges in multi-agent reinforcement learning (MARL). Existing approaches often face issues such as poor coordination, suboptimal convergence, and overestimation bias, undermining their applicability to high-stakes environments. We propose the Sequential Decision Distribution Soft Actor-Critic (SD2AC), a novel off-policy MARL framework for drone swarm coordination to address these challenges. SD2AC introduces three key innovations: (1) a sequential decision-making framework in which agents adaptively condition their decisions on the policies of preceding agents, enhancing coordination and reducing conflicts; (2) a distributional Q-value critic that models the full return distribution to mitigate overestimation bias and improve policy robustness; and (3) an adaptive twin value distribution learning mechanism that leverages dual critics to dynamically select conservative value estimates, ensuring stable learning under uncertainty. Rigorous evaluation in diverse environments, including Close Air Combat, Multi-agent Encirclement with Collision Avoidance (MECA), and MAMuJoCo, demonstrate the superiority of SD2AC over state-of-the-art MARL methods in terms of reward optimization, convergence speed, and coordination efficiency. Ablation studies further validate the individual contributions of each component.

参考文献:

正在载入数据...

版权所有©北京联合大学 重庆维普资讯有限公司 渝B2-20050021-8 
渝公网安备 50019002500408号 违法和不良信息举报中心