Abstract

Visual sentiment analysis aims to automatically recognize positive and negative emotions from images. There are three main challenges, including large intra-class variance, fine-grained image categories, and scalability. Most existing methods predominantly focus on one or two challenges, which has limited their performance. In this paper, we propose a novel visual sentiment analysis approach with deep coupled adjective and noun neural networks. Specifically, to reduce the large intra-class variance, we first learn a shared middle-level sentiment representation by jointly learning an adjective and a noun deep neural network with weak label supervision. Second, based on the learned sentiment representation, a prediction network is further optimized to deal with the subtle differences which often exist in the fine-grained image categories. The three networks are trained in an end-to-end manner, where the middle-level representations learned in previous two networks can guide the sentiment network to achieve high performance and fast convergence. Third, we generalize the training with mutual supervision between the learned adjective and noun networks by a Rectified Kullback-Leibler loss (\emph{ReKL}), when the adjective and noun labels are not available. Extensive experiments on two widely-used datasets show that our method outperforms the state-of-the-art on SentiBank dataset with $10.2\%$ accuracy gain and surpasses the previous best approach on Twitter dataset with clear margins.