详细信息
Multi-view Isolated sign language recognition based on cross-view and multi-level transformer ( SCI-EXPANDED收录 EI收录)
文献类型:期刊文献
英文题名:Multi-view Isolated sign language recognition based on cross-view and multi-level transformer
作者:Guan, Zhong[1,2];Hu, Yongli[1];Jiang, Huajie[1];Sun, Yanfeng[1];Yin, Baocai[1]
第一作者:关忠;Guan, Zhong
通讯作者:Hu, YL[1]
机构:[1]Beijing Univ Technol, Beijing Inst Artificial Intelligence, Sch Informat Sci & Technol, Beijing Key Lab Multimedia & Intelligent Software, 100 Pingleyuan, Beijing 100124, Peoples R China;[2]Beijing Union Univ, Special Educ Coll, 97 Beisihuan East Rd, Beijing 100101, Peoples R China
第一机构:Beijing Univ Technol, Beijing Inst Artificial Intelligence, Sch Informat Sci & Technol, Beijing Key Lab Multimedia & Intelligent Software, 100 Pingleyuan, Beijing 100124, Peoples R China
通讯机构:[1]corresponding author), Beijing Univ Technol, Beijing Inst Artificial Intelligence, Sch Informat Sci & Technol, Beijing Key Lab Multimedia & Intelligent Software, 100 Pingleyuan, Beijing 100124, Peoples R China.
年份:2025
卷号:31
期号:3
外文期刊名:MULTIMEDIA SYSTEMS
收录:;EI(收录号:20251918366700);Scopus(收录号:2-s2.0-105004224266);WOS:【SCI-EXPANDED(收录号:WOS:001479674600001)】;
语种:英文
外文关键词:Isolated sign language recognition; Multi-view recognition; Multi-level transformer; Multi-view sign language dataset
摘要:Sign language serves as a critical communication medium for the deaf community, yet existing single-view recognition systems are limited in interpreting complex three-dimensional manual movements from monocular video sequences. Although multi-view analysis holds potential for improved spatial understanding, current methods lack effective mechanisms for cross-view feature correlation and adaptive multi-stream fusion. To address these challenges, we propose the Cross-view and Multi-level Transformer (CMTformer), a novel framework for isolated sign language recognition that hierarchically models spatiotemporal dependencies across viewpoints. The architecture integrates transformer-based modules to simultaneously capture dense cross-view correlations and distill high-level semantic relationships through multi-scale feature abstraction. Complementing this methodological advancement, we establish the Multi-View Chinese Sign Language (MVCSL) dataset under real-world conditions, addressing the critical shortage of multi-view benchmarking resources. Experimental evaluations demonstrate that CMTformer significantly outperforms conventional approaches in recognition robustness, particularly in processing intricate gesture dynamics through coordinated multi-view analysis. This study advances sign language recognition via interpretable cross-view modeling while providing an essential dataset for developing viewpoint-agnostic gesture understanding systems.
参考文献:
正在载入数据...