CanvasEmb: Learning Layout Representation with Large-scale Pre-training for Graphic Design

Layout representation, which models visual elements and their inter-relations in a canvas, plays a crucial role in graphic design intelligence. With a large variety of layout designs and the unique characteristic of layouts that visual elements are defined as a list of categorical (e.g., type) and numerical (e.g., position and size) properties, it is challenging to learn general and compact representations with limited data. Inspired by the recent success of self-supervised pre-training techniques in various natural language processing tasks, in this paper, we propose CanvasEmb (Canvas Embedding), which pre-trains deep representations from unlabeled graphic designs by jointly conditioning on all the context elements in a canvas, with a multi-dimensional feature encoder and a multi-task learning objective. The pre-trained CanvasEmb model can be fine-tuned with just one additional output layer and with a small size of training data to create models for a wide range of downstream tasks. We verify our approach with presentation slides data. We construct a large-scale dataset with more than one million slides and propose two layout understanding tasks with human-labeled sets, namely element role labeling and image captioning. Evaluation results on these two tasks show that our model with fine-tuning achieves state-of-the-art performance. Furthermore, we conduct a deep analysis aiming to understand the modeling mechanism of CanvasEmb and demonstrate its great potential with two extended applications: layout auto completion and layout retrieval.