详细信息
CBHA-DETR: multi-kernel attention and deformable fusion network for behavior recognition in classroom monitoring ( SCI-EXPANDED收录 EI收录)
文献类型:期刊文献
英文题名:CBHA-DETR: multi-kernel attention and deformable fusion network for behavior recognition in classroom monitoring
作者:Li, Tianci[1,2];Wang, Jin[1,2];Xu, Cheng[1,2];Xu, Bingxin[1,2];An, Ning[3];Zhang, Jiancheng[1,2]
第一作者:Li, Tianci
通讯作者:Zhang, JC[1];Zhang, JC[2]
机构:[1]Beijing Union Univ, Beijing Key Lab Informat Serv Engn, 97 Beisihuan East Rd, Beijing 100101, Peoples R China;[2]Beijing Union Univ, Coll Robot, 97 Beisihuan East Rd, Beijing 100101, Peoples R China;[3]Beijing Union Univ, Div Acad Affairs, 97 Beisihuan East Rd, Beijing 100101, Peoples R China
第一机构:北京联合大学北京市信息服务工程重点实验室
通讯机构:[1]corresponding author), Beijing Union Univ, Beijing Key Lab Informat Serv Engn, 97 Beisihuan East Rd, Beijing 100101, Peoples R China;[2]corresponding author), Beijing Union Univ, Coll Robot, 97 Beisihuan East Rd, Beijing 100101, Peoples R China.|[1141739]北京联合大学机器人学院;[11417]北京联合大学;[11417103]北京联合大学北京市信息服务工程重点实验室;
年份:2026
卷号:32
期号:2
外文期刊名:MULTIMEDIA SYSTEMS
收录:;EI(收录号:20260620030582);WOS:【SCI-EXPANDED(收录号:WOS:001680937300010)】;
基金:This work was supported by the Key Project of the National Language Commission (ZDI145-110), the Beijing Higher Education Teaching Reform Project(202411417002), the Ministry of Education Project(25YJC740057), the Key Laboratory Project (YYZN-2024-6), the Beijing Municipal Education Working Committee Project (XXSZ2024GZ17), the Project of Beijing Municipal Institutions (BPHR20220121).
语种:英文
外文关键词:Classroom behavior recognition; DETR; Deformable convolution; Vision transformer
摘要:Classroom behavior recognition in complex educational environments poses significant challenges due to occlusions, multi-scale interactions, and fine-grained behavior recognition. To address these limitations, we propose CBHA-DETR, a novel multi-kernel attention and deformable fusion network optimized for real-time behavior detection in classroom monitoring scenarios. The framework integrates a hybrid architecture featuring a Progressive Kernel Inception Block for Classroom (PKICBlock) and Monte Carlo Attention Block (MoCABlock) within its deep backbone network layer. These components synergistically enable multi-scale feature extraction through progressive parallel depthwise separable convolutions and stochastic attention mechanisms, effectively capturing spatial-contextual relationships from local movements to full-body postures. And we exclusively apply the Transformer encoder to the final layer of the backbone network to significantly reduce parameter complexity. A deformable cross-scale fusion neck (DCFusion) adaptively aligns multi-scale features via deformable convolution and content-aware upsampling (CARAFE), significantly improving geometric adaptability to complex postures. The predictions from the Transformer decoder are optimized using the Normalized Wasserstein Distance (NWD) and shape constraint metrics, enhancing geometric perception for asymmetric postures. Extensive experiments on SCB-Dataset3 demonstrate competitive performance. Specifically, our model achieves 73.2% precision and 70.9% recall with 73.4% mAP50, surpassing RT-DETR-R34 by 1.0% precision, 1.9% recall and 2.0% mAP50 while reducing parameters by 38.7% from 31.3M to 19.2M and maintaining real-time inference at 40.8 FPS, providing a practical and scalable solution for deployment in intelligent education systems.
参考文献:
正在载入数据...
