Document AI, or Document Intelligence, is a new research topic that refers to techniques for automatically reading, understanding, and analyzing business documents. Understanding business documents is an incredibly challenging task due to the diversity of layouts and formats, inferior quality of scanned document images as well as the complexity of template structures.
Starting in 2019, we released two benchmark datasets TableBank and DocBank, which are used for table detection and recognition as well as the page object detection for documents. Recently, we released two new benchmark datasets, where ReadingBank for the reading order detection task, and XFUND for the multi-lingual form understanding task that contains forms in seven languages.
In addition to the benchmark datasets, we also created the multi-modal pre-training framework LayoutLM for Document AI, together with the latest LayoutLMv2 and the multi-lingual version LayoutXLM, which have been widely adopted by 1st and 3rd party products and applications in Azure AI, such as Form Recognizer. The LayoutLM/LayoutXLM model family has been applied to a wide range of Document AI applications, including table detection, page object detection, LayoutReader for reading order detection, form/receipt/invoice understanding, complex document understanding, document image classification, document VQA, etc., meanwhile achieving state-of-the-art performance across these benchmarks.
Moreover, MarkupLM is also proposed to jointly pre-train text and markup language in a single framework for markup-based VrDU tasks. Distinct from fixed-layout documents, markup-based documents provide another viewpoint for the document representation learning through markup structures because the 2D position information and document image information cannot be used straightforwardly during the pre-training. Instead, MarkupLM takes advantage of the tree-based markup structures to model the relationship among different units within the document.
Recently, we presented our latest research for OCR, namely TrOCR, which is a Transformer-based OCR with a pre-trained image Transformer and a text Transformer. TrOCR is convolution free and can be easily adapted for multilingual text recognition as well as cloud/edge deployment.
Image Transformer has recently achieved considerable progress for natural image understanding, either using supervised (ViT, DeiT, etc.) or self-supervised (BEiT, MAE, etc.) pre-training techniques. We propose DiT, a self-supervised pre-trained Document Image Transformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human-labeled document images. We leverage DiT as the backbone network in a variety of vision-based Document AI tasks, including document image classification, document layout analysis, as well as table detection, where significant improvements and new SOTA results have been achieved.
LayoutLMv3, a multimodal pre-trained Transformer for Document AI with unified text and image masking. Additionally, it is also pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric (e.g., form/receipt understanding) and image-centric (e.g., document layout analysis, table detection) Document AI tasks.
The LayoutLM model family has become the Document Foundation Models for many 1st party and 3rd party applications. Meanwhile, LayoutLM, LayoutLMv2, LayoutXLM, TrOCR, DiT, and LayoutLMv3 are now part of HuggingFace!
Contact: Lei Cui, Furu Wei