详细信息
TSIC-CLIP: Traffic Scene Image Captioning Model Based on Clip ( SCI-EXPANDED收录)
文献类型:期刊文献
英文题名:TSIC-CLIP: Traffic Scene Image Captioning Model Based on Clip
作者:Zhang, Hao[1,2];Xu, Cheng[1,2];Xu, Bingxin[1,2];Jiane, Muwei[3];Liu, Hongzhe[1,2];Li, Xuewei[1,2]
通讯作者:Xu, C[1];Xu, BX[1];Xu, C[2];Xu, BX[2]
机构:[1]Beijing Union Univ, Beijing Key Lab Informat Serv Engn, Beijing, Peoples R China;[2]Beijing Union Univ, Inst Brain & Cognit Sci, Coll Robot, Beijing, Peoples R China;[3]Shandong Univ Finance & Econ, Sch Comp Sci & Technol, Jinan, Peoples R China
第一机构:北京联合大学北京市信息服务工程重点实验室
通讯机构:[1]corresponding author), Beijing Union Univ, Beijing Key Lab Informat Serv Engn, Beijing, Peoples R China;[2]corresponding author), Beijing Union Univ, Inst Brain & Cognit Sci, Coll Robot, Beijing, Peoples R China.|[1141739]北京联合大学机器人学院;[11417]北京联合大学;[11417103]北京联合大学北京市信息服务工程重点实验室;
年份:2024
卷号:53
期号:1
起止页码:98-114
外文期刊名:INFORMATION TECHNOLOGY AND CONTROL
收录:;Scopus(收录号:2-s2.0-85188968177);WOS:【SCI-EXPANDED(收录号:WOS:001280512700006)】;
基金:This work was supported, the National Natural Science Foundation of China (Grant No. 62006020, 62102033, 62171042), the R&D Program of Beijing Municipal Education Commission (Grant No. KZ202211417048), The Project of Construction and Support for high-level Innovative Teams of Beijing Municipal Institutions (Grant No. BPHR20220121), Beijing Natural Science Foundation (Grant No. 4232026), the Academic Research Projects of Beijing Union University (No. ZKZD202302).
语种:英文
外文关键词:Contrastive learning; Deep learning; Image captioning; Traffic scene; Transformer
摘要:Image captioning in traffic scenes presents several challenges, including imprecise caption generation, lack of personalization, and an unwieldy number of model parameters. We propose a new image captioning model for traffic scenes to address these issues. The model incorporates an adapter-based fine-tuned feature extraction part to enhance personalization and a caption generation module using global weighted attention pooling to reduce model parameters and improve accuracy. The proposed model consists of four main stages. In the first stage, the Image-Encoder extracts the global features of the input image and divides it into nine sub-regions, encoding each sub-region separately. In the second stage, the Text-Encoder encodes the text dataset to obtain text features. It then calculates the similarity between the image sub-region features and encoded text features, selecting the text features with the highest similarity. Subsequently, the pre-trained Faster RCNN model extracts local image features. The model then splices together the text features, global image features, and local image features to fuse the multimodal information. In the final stage, the extracted features are fed into the Captioning model, which effectively fuses the different features using a novel global weighted attention pooling layer. The Captioning model then generates natural language image captions. The proposed model is evaluated on the MS-COCO dataset, Flickr 30K dataset, and BUUISE-Image dataset, using mainstream evaluation metrics. Experiments demonstrate significant improvements across all evaluation metrics on the public datasets and strong performance on the BUUISE-Image traffic scene dataset.
参考文献:
正在载入数据...