Video Summarization by Learning Deep Side Semantic Embedding

Yitian Yuan, Tao Mei, Peng Cui, Wenwu Zhu

IEEE Transactions on Circuits and Systems for Video Technology |

With the rapid growth of video content, video summarization, which focuses on automatically selecting important and informative parts from videos, is becoming increasingly crucial. However, the problem is challenging due to its subjectiveness. Previous research, which predominantly relies on manually designed criteria or resourcefully expensive human annotations, often fails to achieve satisfying results. We observe that the side information associated with a video (e.g., surrounding text such as titles, queries, descriptions, comments, and so on) represents a kind of human-curated semantics of video content. This side information, although valuable for video summarization, is overlooked in existing approaches. In this paper, we present a novel Deep Side Semantic Embedding (DSSE) model to generate video summaries by leveraging the freely available side information. The DSSE constructs a latent subspace by correlating the hidden layers of the two uni-modal autoencoders, which embed the video frames and side information, respectively. Specifically, by interactively minimizing the semantic relevance loss and the feature reconstruction loss of the two uni-modal autoencoders, the comparable common information between video frames and side information can be more completely learned. Therefore, their semantic relevance can be more effectively measured. Finally, semantically meaningful segments are selected from videos by minimizing their distances to the side information in the constructed latent subspace. We conduct experiments on two datasets (Thumb1K and TVSum50) and demonstrate the superior performance of DSSE to several state-of-the-art approaches to video summarization.