Transferring General Multimodal Pretrained Models to Text Recognition

Junyang Lin, Xuancheng Ren, Yichang Zhang, Gao Liu, Peng Wang, An Yang, Chang Zhou

Findings: Language Grounding to Vision, Robotics, and Beyond Findings Paper

Session 4: Language Grounding to Vision, Robotics, and Beyond (Virtual Poster)
Conference Room: Pier 7&8
Conference Time: July 11, 11:00-12:30 (EDT) (America/Toronto)
Global Time: July 11, Session 4 (15:00-16:30 UTC)
Keywords: cross-modal content generation, cross-modal application
Languages: chinese
TLDR: This paper proposes a new method, OFA-OCR, to transfer multimodal pretrained models to text recognition. Specifically, we recast text recognition as image captioning and directly transfer a unified vision-language pretrained model to the end task. Without pretraining on large-scale annotated or synt...
You can open the #paper-P1593 channel in a separate window.
Abstract: This paper proposes a new method, OFA-OCR, to transfer multimodal pretrained models to text recognition. Specifically, we recast text recognition as image captioning and directly transfer a unified vision-language pretrained model to the end task. Without pretraining on large-scale annotated or synthetic text recognition data, OFA-OCR outperforms the baselines and achieves state-of-the-art performance in the Chinese text recognition benchmark. Additionally, we construct an OCR pipeline with OFA-OCR, and we demonstrate that it can achieve competitive performance with the product-level API.