Document AI (Intelligent Document Processing)

Document AI (opens in new tab), or Document Intelligence, is a new research topic that refers to techniques for automatically reading, understanding, and analyzing business documents. Understanding business documents is an incredibly challenging task due to the diversity of layouts and formats, inferior quality of scanned document images as well as the complexity of template structures.

Starting in 2019, we released two benchmark datasets TableBank (opens in new tab) and DocBank (opens in new tab), which are used for table detection and recognition as well as the page object detection for documents. Recently, we released two new benchmark datasets, where ReadingBank (opens in new tab) for the reading order detection task, and XFUND (opens in new tab) for the multi-lingual form understanding task that contains forms in seven languages.

In addition to the benchmark datasets, we also proposed the multimodal Document Foundation Model, including the pre-trained LayoutLM (opens in new tab) model family for Document AI which has been widely adopted by 1st and 3rd party products and applications in Azure AI, such as Form Recognizer (opens in new tab).  The LayoutLM/LayoutXLM model family has been applied to a wide range of Document AI applications, including table detection, page object detection, LayoutReader (opens in new tab) for reading order detection, form/receipt/invoice understanding, complex document understanding, document image classification, document VQA, etc., meanwhile achieving state-of-the-art performance across these benchmarks.

Moreover, MarkupLM (opens in new tab) is also proposed to jointly pre-train text and markup language in a single framework for markup-based VrDU tasks. Distinct from fixed-layout documents, markup-based documents provide another viewpoint for the document representation learning through markup structures because the 2D position information and document image information cannot be used straightforwardly during the pre-training. Instead, MarkupLM takes advantage of the tree-based markup structures to model the relationship among different units within the document.

Recently, we presented our latest research for OCR, namely TrOCR (opens in new tab), which is a Transformer-based OCR with a pre-trained image Transformer and a text Transformer. TrOCR is convolution free and can be easily adapted for multilingual text recognition as well as cloud/edge deployment.

Image Transformer has recently achieved considerable progress for natural image understanding, either using supervised (ViT, DeiT, etc.) or self-supervised (BEiT, MAE, etc.) pre-training techniques. We propose DiT (opens in new tab), a self-supervised pre-trained Document Image Transformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human-labeled document images. We leverage DiT as the backbone network in a variety of vision-based Document AI tasks, including document image classification, document layout analysis, as well as table detection, where significant improvements and new SOTA results have been achieved.

LayoutLMv3 (opens in new tab), a multimodal pre-trained Transformer for Document AI with unified text and image masking. Additionally, it is also pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric (e.g., form/receipt understanding) and image-centric (e.g., document layout analysis, table detection) Document AI tasks.

XDoc (opens in new tab), a unified pre-trained model which deals with different document formats in a single model. For parameter efficiency, XDoc shares backbone parameters for different formats such as the word embedding layer and the Transformer layers. Meanwhile, adaptive layers are introduced with lightweight parameters to enhance the distinction across different formats. Experimental results have demonstrated that with only 36.7% parameters, XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models, which is cost-effective for real-world deployment.

The LayoutLM model family has become the Foundation Models of Document AI for many 1st party and 3rd party applications. Meanwhile, LayoutLM (opens in new tab), LayoutLMv2 (opens in new tab), LayoutXLM (opens in new tab), (opens in new tab)LayoutLMv3 (opens in new tab), TrOCR (opens in new tab), DiT (opens in new tab) and MarkupLM (opens in new tab) are now part of HuggingFace (opens in new tab)!

Contact: Lei Cui, Furu Wei

Project Repository:

Model: https://github.com/microsoft/unilm (opens in new tab)

Data: https://github.com/doc-analysis (opens in new tab)

Form Recognizer (opens in new tab)