FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction

Chen-Yu Lee; Chun-Liang Li; Hao Zhang; Timothy Dozat; Vincent Perot; Guolong Su; Xiang Zhang; Kihyuk Sohn; NIKOLAY GLUSHNEV; Renshen Wang; Joshua Ainslie; Shangbang Long; Siyang Qin; Yasuhisa Fujii; Nan Hua; Tomas Pfister

FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction

Chen-Yu Lee, Chun-Liang Li, Hao Zhang, Timothy Dozat, Vincent Perot, Guolong Su, Xiang Zhang, Kihyuk Sohn, NIKOLAY GLUSHNEV, Renshen Wang, Joshua Ainslie, Shangbang Long, Siyang Qin, Yasuhisa Fujii, Nan Hua, Tomas Pfister

📝 Paper

Anthology

Underline 🪧 Poster 📺 Watch Video on Underline Add to Favorites

Main: Information Extraction Main-poster Paper

Poster Session 3: Information Extraction (Poster)

Conference Room: Frontenac Ballroom and Queen's Quay

Conference Time: July 11, 09:00-10:30 (EDT) (America/Toronto)

Global Time: July 11, Poster Session 3 (13:00-14:30 UTC)

Keywords: named entity recognition and relation extraction

TLDR: The recent advent of self-supervised pre-training techniques has led to a surge in the use of multimodal learning in form document understanding. However, existing approaches that extend the mask language modeling to other modalities require careful multi-task tuning, complex reconstruction target d...

You can open the #paper-P4317 channel in a separate window.

Abstract: The recent advent of self-supervised pre-training techniques has led to a surge in the use of multimodal learning in form document understanding. However, existing approaches that extend the mask language modeling to other modalities require careful multi-task tuning, complex reconstruction target designs, or additional pre-training data. In FormNetV2, we introduce a centralized multimodal graph contrastive learning strategy to unify self-supervised pre-training for all modalities in one loss. The graph contrastive objective maximizes the agreement of multimodal representations, providing a natural interplay for all modalities without special customization. In addition, we extract image features within the bounding box that joins a pair of tokens connected by a graph edge, capturing more targeted visual cues without loading a sophisticated and separately pre-trained image embedder. FormNetV2 establishes new state-of-the-art performance on FUNSD, CORD, SROIE and Payment benchmarks with a more compact model size.