登录    注册    忘记密码

详细信息

基于多模态融合的城市道路场景视频描述模型研究    

Multimodal fusion for video captioning on urban road scene

文献类型:期刊文献

中文题名:基于多模态融合的城市道路场景视频描述模型研究

英文题名:Multimodal fusion for video captioning on urban road scene

作者:李铭兴[1,2];徐成[1,2];李学伟[1,2];刘宏哲[1,2];闫晨阳[1,2];廖文森[1,2]

第一作者:李铭兴

机构:[1]北京联合大学北京市信息服务工程重点实验室,北京100101;[2]北京联合大学脑与认知智能北京实验室,北京100101

第一机构:北京联合大学北京市信息服务工程重点实验室

年份:2023

卷号:40

期号:2

起止页码:607-611

中文期刊名:计算机应用研究

外文期刊名:Application Research of Computers

收录:CSTPCD;;北大核心:【北大核心2020】;CSCD:【CSCD_E2023_2024】;

基金:国家自然科学基金资助项目(62171042,62102033,61906017,61802019);北京市重点科技项目(KZ202211417048);协同创新中心资助项目(CYXC2203);北京联合大学学术研究项目(BPHR2020DZ02,ZB10202003,ZK40202101,ZK120202104)。

语种:中文

中文关键词:视频描述;多模态融合;注意力机制;智能驾驶

外文关键词:video caption;multimodal fusion;attention mechanism;intelligent driving

摘要:城市道路视频描述存在仅考虑视觉信息而忽视了同样重要的音频信息的问题,多模态融合算法是解决此问题的方案之一。针对现有基于Transformer的多模态融合算法都存在着模态之间融合性能低、计算复杂度高的问题,为了提高多模态信息之间的交互性,提出了一种新的基于Transformer的视频描述模型多模态注意力瓶颈视频描述(multimodal attention bottleneck for video captioning,MABVC)。首先使用预训练好的I3D和VGGish网络提取视频的视觉和音频特征并将提取好的特征输入到Transformer模型当中,然后解码器部分分别训练两个模态的信息再进行多模态的融合,最后将解码器输出的结果经过处理生成人们可以理解的文本描述。在通用数据集MSR-VTT、MSVD和自建数据集BUUISE上进行对比实验,通过评价指标对模型进行验证。实验结果表明,基于多模态注意力融合的视频描述模型在各个指标上都有明显提升。该模型在交通场景数据集上依旧能够取得良好的效果,在智能驾驶行业具有很大的应用前景。
Multimodal fusion algorithm is one of the solutions to the problem of urban road video caption which only considers visual information and ignores the equally important audio information.Existing multimodal fusion algorithms based on Transformer all have the problem of low fusion performance between modes and high computational complexity.In order to improve the interaction between multimodal information,this paper recently proposed a new Transformer based model called multimodal attention bottleneck for video captioning(MABVC).Firstly,this paper used pre-trained I3D and VGGish networks to extract visual and audio features of video and input the extracted features into Transformer model.Then,the decoder part would train the information of the two modes respectively and perform multimodal fusion.Finally,the model processed the results of the decoder and generated text captions that people could understand.This paper conducted a comparison experiments using data sets MSR-VTT,MSVD and self-built data sets BUUISE,and validated model results using evaluation metrics the model.The experimental results show that the video caption model based on multimodal attention fusion has obvious improvement in all indicators.The model can still achieve good results on traffic scene data sets,and has great application prospects,which can be promoted and applied in intelligent driving industry.

参考文献:

正在载入数据...

版权所有©北京联合大学 重庆维普资讯有限公司 渝B2-20050021-8 
渝公网安备 50019002500408号 违法和不良信息举报中心