详细信息
VSLayout: Visual-Semantic Representation Learning for Document Layout Analysis ( EI收录)
文献类型:会议论文
英文题名:VSLayout: Visual-Semantic Representation Learning for Document Layout Analysis
作者:Wang, Shangrong[1]; Jiang, Jing[2]; Jiang, Yanjun[1]; Zhang, Xuesong[1]
第一作者:Wang, Shangrong
机构:[1] School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China; [2] Department of Communication Engineering, Beijing Union University, Beijing, China
第一机构:School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China
通讯机构:[1]School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China
会议论文集:ACAI 2022 - Conference Proceedings: 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence
会议日期:December 23, 2022 - December 25, 2022
会议地点:Sanya, China
语种:英文
外文关键词:Benchmarking - Embeddings - Modal analysis - Optical character recognition
摘要:Document layout analysis (DLA), aiming to extract and classify the structural regions, is a rather challenging and critical step for many downstream document understanding tasks. Although the fusion of text (semantics) and image (vision) features has shown significant advantages for DLA, existing methods either require simultaneous text-image pair inputs, which is not applicable when only document images are available, or have to resort to an optical character recognition (OCR) preprocessing. This paper learns the visual-sematic representation for DLA only from the imaging modality of documents, which greatly extends the applicability of DLA to practical applications. Our method consists of three phases. Firstly, we train a text feature extractor (TFE) for document images via cross-modal supervision that enforces the coherence between the outputs of TFE and the text embedding map generated by Sent2Vec. Then the pretrained TFE gets further adapted using only the document images and extracts shallow semantic features that will be further fed into the third stage. Finally, a two-stream network is employed to extract the deep semantic and visual features, and their fusion is used as the input to a detector module, e.g., the RPN (Region Proposal Network), to generate the final results. On benchmark datasets, we demonstrate that the proposed TFE model outperforms main-stream semantic embedding counterparts and that our approach achieves superior DLA performance to baseline methods. ? 2022 ACM.
参考文献:
正在载入数据...