Document AI

Document AI, or Document Intelligence, is a relatively new research topic that refers to techniques for automatically reading, understanding, and analyzing business documents. Understanding business documents is a very challenging task due to the diversity of layouts and formats, poor quality of scanned document images as well as the complexity of template structures.

Starting from 2019, we released two benchmark datasets TableBank and DocBank, which are used for table detection and recognition as well as the page object detection for documents. Recently, we released two new benchmark datasets, where ReadingBank for the reading order detection task, and XFUND for the multi-lingual form understanding task that contains forms in 7 languages.

In addition to the benchmark datasets, we also created the multi-modal pre-training framework LayoutLM for Document AI, together with the latest LayoutLMv2 and the multi-lingual version LayoutXLM, which have been widely adopted by 1st and 3rd party products and applications in Azure AI, such as Form Recognizer.  The LayoutLM/LayoutXLM model family has been applied into a wide range of Document AI applications, including table detection, page object detection, LayoutReader for reading order detection, form/receipt/invoice understanding, complex document understanding, document image classification, document VQA, etc., meanwhile achieving state-of-the-art performance across these benchmarks.

LayoutLM, LayoutLMv2 and LayoutXLM are now part of HuggingFace!

Recently, we present our latest research for OCR, namely TrOCR, which is a Transformer-based OCR with pre-trained image Transformer and text Transformer. TrOCR is convolution free, and can be easily adapted for multilingual text recognition as well as cloud/edge deployment.

Related products:

Form Recognizer