MSR-VTT: A Large Video Description Dataset for Bridging Video and Language [Supplementary Material]

  • Jun Xu ,
  • Tao Mei ,
  • Ting Yao ,
  • Yong Rui

Published by IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)

When organizing the Microsoft Research Video To Language challenge (, we found that, in our previously released dataset (CVPR 2016 paper), some sentences annotated by AMT workers are identical in one video clip or very similar in one category. Therefore, to control the quality of data and annotations, as well as the competitions, we removed those simple and duplicated sentences and replaced them with refined ones. We finally released the fixed dataset in our challenge website. Due to these modifications of the dataset, the performance cannot be well matched with what we reported in our CVPR paper. Here, we have reported the new performance in the following tables which also appeared in our CVPR paper (referred to as Table 1~7, respectively). If you are trying to reproduce or compare the baselines conducted on our MSR-VTT dataset, please refer to this supplementary material and the updated performance reported in this material. However, please cite our CVPR paper if you want to use the MSR-VTT as your dataset.