Abstract

Automatic generation of natural language description for individual images (a.k.a. image captioning) has attracted extensive research attention. In this paper, we take one step further to investigate the generation of a paragraph to describe a photo stream for the purpose of storytelling. This task is even more challenging than individual image description due to the difficulty in modeling the large visual variance in an ordered photo collection and in preserving the long-term language coherence among multiple sentences. To deal with these challenges, we formulate the task as a sequence-to-sequence learning problem and propose a novel joint learning model by leveraging the semantic coherence in a photo stream. Specifically, to reduce visual variance, we learn a semantic space by jointly embedding each photo with its corresponding contextual sentence, so that the semantically related photos and their correlations are discovered. Then, to preserve language coherence in the paragraph, we learn a novel Bidirectional Attention-based Recurrent Neural Network (BARNN) model, which can attend on the discovered semantic relation to produce a sentence sequence and maintain its consistence with the photo stream. We integrate the two-step learning components into one single optimization formulation and train the network in an end-to-end manner. Experiments on three widely-used datasets (NYC/Disney/SIND) show that the proposed approach outperforms state-of-the-art methods with large margins for both retrieval and paragraph generation tasks. We also show the subjective preference of the machine-generated stories by the proposed approach over the baselines through a user study with 40 human subjects.