登录    注册    忘记密码

详细信息

TRANSTL: SPATIAL-TEMPORAL LOCALIZATION TRANSFORMER FOR MULTI-LABEL VIDEO CLASSIFICATION  ( CPCI-S收录 EI收录)  

文献类型:期刊文献

英文题名:TRANSTL: SPATIAL-TEMPORAL LOCALIZATION TRANSFORMER FOR MULTI-LABEL VIDEO CLASSIFICATION

作者:Wu, Hongjun[1];Li, Mengzhu[1];Liu, Yongcheng[2];Liu, Hongzhe[1];Xu, Cheng[1];Li, Xuewei[1]

第一作者:Wu, Hongjun

通讯作者:Liu, HZ[1]

机构:[1]Beijing Union Univ, Beijing Key Lab Informat Serv Engn, Beijing, Peoples R China;[2]Chinese Acad Sci, Inst Automat, Beijing, Peoples R China

第一机构:北京联合大学北京市信息服务工程重点实验室

通讯机构:[1]corresponding author), Beijing Union Univ, Beijing Key Lab Informat Serv Engn, Beijing, Peoples R China.|[11417103]北京联合大学北京市信息服务工程重点实验室;[11417]北京联合大学;

年份:2022

起止页码:1965-1969

外文期刊名:2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)

收录:EI(收录号:20222912369200);Scopus(收录号:2-s2.0-85134050288);WOS:【CPCI-S(收录号:WOS:000864187902048)】;

基金:This work was supported, the National Natural Science Foundation of China (Grant No. 61871039, 62171042), the Academic Research Projects of Beijing Union University(No. ZB10202003, ZK40202101, ZK120202104). *corresponding author: liuhongzhe@buu.edu.cn

语种:英文

外文关键词:Multi-label Video Classification; Label Co-occurrence Dependency; Spatial Temporal Label Dependency; Transformer

摘要:Multi-label video classification (MLVC) is a long-standing and challenging research problem in video signal analysis. Generally, there exist many complex action labels in real-world videos and these actions are with inherent dependencies at both spatial and temporal domains. Motivated by this observation, we propose TranSTL, a spatial-temporal localization Transformer framework for MLVC task. In addition to leverage global action label co-occurrence, we also propose a novel plug-and-play Spatial Temporal Label Dependency (STLD) layer in TranSTL. STLD not only dynamically models the label co-occurrence in a video by self-attention mechanism, but also fully captures spatial-temporal label dependencies using cross-attention strategy. As a result, our TranSTL is able to explicitly and accurately grasp the diverse action labels at both spatial and temporal domains. Extensive evaluation and empirical analysis show that TranSTL achieves superior performance over the state of the arts on two challenging benchmarks, Charades and Multi-Thumos.

参考文献:

正在载入数据...

版权所有©北京联合大学 重庆维普资讯有限公司 渝B2-20050021-8 
渝公网安备 50019002500408号 违法和不良信息举报中心