登录    注册    忘记密码

详细信息

VSLayout: Visual-Semantic Representation Learning for Document Layout Analysis  ( EI收录)  

文献类型:会议论文

英文题名:VSLayout: Visual-Semantic Representation Learning for Document Layout Analysis

作者:Wang, Shangrong[1]; Jiang, Jing[2]; Jiang, Yanjun[1]; Zhang, Xuesong[1]

第一作者:Wang, Shangrong

机构:[1] School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China; [2] Department of Communication Engineering, Beijing Union University, Beijing, China

第一机构:School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China

通讯机构:[1]School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China

会议论文集:ACAI 2022 - Conference Proceedings: 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence

会议日期:December 23, 2022 - December 25, 2022

会议地点:Sanya, China

语种:英文

外文关键词:Benchmarking - Embeddings - Modal analysis - Optical character recognition

摘要:Document layout analysis (DLA), aiming to extract and classify the structural regions, is a rather challenging and critical step for many downstream document understanding tasks. Although the fusion of text (semantics) and image (vision) features has shown significant advantages for DLA, existing methods either require simultaneous text-image pair inputs, which is not applicable when only document images are available, or have to resort to an optical character recognition (OCR) preprocessing. This paper learns the visual-sematic representation for DLA only from the imaging modality of documents, which greatly extends the applicability of DLA to practical applications. Our method consists of three phases. Firstly, we train a text feature extractor (TFE) for document images via cross-modal supervision that enforces the coherence between the outputs of TFE and the text embedding map generated by Sent2Vec. Then the pretrained TFE gets further adapted using only the document images and extracts shallow semantic features that will be further fed into the third stage. Finally, a two-stream network is employed to extract the deep semantic and visual features, and their fusion is used as the input to a detector module, e.g., the RPN (Region Proposal Network), to generate the final results. On benchmark datasets, we demonstrate that the proposed TFE model outperforms main-stream semantic embedding counterparts and that our approach achieves superior DLA performance to baseline methods. ? 2022 ACM.

参考文献:

正在载入数据...

版权所有©北京联合大学 重庆维普资讯有限公司 渝B2-20050021-8 
渝公网安备 50019002500408号 违法和不良信息举报中心