详细信息

StreamCMT: Prior-Guided Multimodal Temporal Fusion for Sparse 3D Object Detection ( SCI-EXPANDED收录 EI收录)

文献类型：期刊文献

英文题名：StreamCMT: Prior-Guided Multimodal Temporal Fusion for Sparse 3D Object Detection

作者：Huang, Yanliang[1];Liu, Yuansheng[1,2]

第一作者：Huang, Yanliang

通讯作者：Liu, YS[1]

机构：[1]Beijing Union Univ, Coll Robot, Beijing 100101, Peoples R China;[2]Beijing Union Univ, Dept Elect Engn, Beijing 100101, Peoples R China

第一机构：北京联合大学机器人学院

通讯机构：[1]corresponding author), Beijing Union Univ, Dept Elect Engn, Beijing 100101, Peoples R China.|[11417109]北京联合大学智慧城市学院电子工程系;[11417]北京联合大学;[1141734]智慧城市学院;

年份：2026

卷号：11

期号：5

起止页码：5358-5365

外文期刊名：IEEE ROBOTICS AND AUTOMATION LETTERS

收录：;EI(收录号：20261120275218);WOS:【SCI-EXPANDED(收录号:WOS:001719505900015)】；

基金：This work was supported by the National Natural Science Foundation of China (NSFC), under Grant 62371013.

语种：英文

外文关键词：Three-dimensional displays; Feature extraction; Encoding; Accuracy; Image coding; Object detection; Cameras; Multilayer perceptrons; Laser radar; Point cloud compression; Autonomous vehicle navigation; sensor fusion; deep learning for visual perception

摘要：Multimodal 3D detection is critical for autonomous driving reliability. While most existing methods boost accuracy via elaborate networks, they neglect inference speed which is essential for real-world deployment. Although existing decoder-based sparse query detection methods offer advantages in real-time performance, they suffer from limitations in convergence speed and cross-modal feature integration. To address these challenges of slow convergence and inadequate feature fusion, this letter proposes a Prior-Guided Position Embedding Module based on the Cross Modal Transformer (CMT) framework. The module reconstructs 3D sampling point distribution through spatial geometric priors, effectively improving model accuracy and accelerating convergence without incurring additional computational overhead. Concurrently, to enhance motion awareness, we integrate a Temporal Fusion Module that leverages historical frame information to optimize current detection performance. Experimental results demonstrate that StreamCMT achieves a detection accuracy of 72.5% NDS and 69.6% mAP on the nuScenes test set. On the validation set, compared to the baseline model, it improves NDS and mAP by 1.0% and 1.1% respectively, while increasing inference speed from 12.0 to 14.4 FPS. The model maintains a lightweight architecture while achieving an effective trade-off between detection accuracy and inference efficiency for autonomous driving perception systems.

参考文献：

正在载入数据...

北京联合大学机构知识库

详细信息

StreamCMT: Prior-Guided Multimodal Temporal Fusion for Sparse 3D Object Detection ( SCI-EXPANDED收录 EI收录)

参考文献：