STAIR Captions

A Large-Scale Japanese Image Caption Dataset


In recent years, automatic generation of image descriptions (captions), that is, image captioning, has attracted a great deal of attention. In this paper, we particularly consider generating Japanese captions for images. Most studies on image captioning target English language, and there are few image caption datasets in Japanese. To tackle this problem, we construct a large-scale Japanese image caption dataset based on images from MS-COCO. Our dataset consists of 820,310 Japanese captions for 164,062 images. In the experiment, we show that a neural network trained using our dataset can generate more natural and better Japanese captions, compared to those generated using English Japanese machine translation after generating English captions.



STAIR Captions YJ Captions [1]
# of images 164,062 (123,287) 26,500
# of captions 820,310 (616.435) 131,730
Vocabulary size 35,642 (31,938) 13,274
Avg. # of chars. 23.79 (23.80) 23.23

The table summarizes the statistics of STAIR Captions. Numbers in the brackets indicate statistics of public part of STAIR Captions. Compared with YJ Captions [1], overall, the numbers of Japanese captions and images in our dataset are 6.23x and 6.19x, respectively. In the public part of our dataset, the numbers of images and Japanese captions are 4.65x and 4.67x greater than those in YJ Captions, respectively. That the numbers of images and captions are large in our dataset is an important point in image caption generation because it reduces the possibility of unknown scenes and objects appearing in the test images. The vocabulary of our dataset is 2.69x larger than that of YJ Captions. Because the large vocabulary of our dataset, it is expected that the caption generation model can learn and generate various captions. The average numbers of characters per a sentence in our dataset and in YJ Captions are almost the same.

[1] Takeshi Miyazaki and Nobuyuki Shimizu, ``Cross-Lingual Image Caption Generation,’’ Association for Computational Linguistics (ACL), 2016.


Japanese Image Caption Generation

We performed experiments that generates Japanese image captions by a neural network learned on STAIR Captions. The below table shows the performances of two methods, MS-COCO + MT and STAIR Captions. MS-COCO + MT first generates English captions for images using a neural network learned on the original MS-COCO dataset, and then, translates the generated captions into Japanese ones using Google Translate. On the other hand, STAIR Captions generates Japanese captions directly using a neural network learned on STAIR Captions dataset. For the details, see our papers.

Bleu-1 Bleu-2 Bleu-3 Bleu-4 ROUGE_L CIDEr
MS-COCO + MT 0.565 0.330 0.204 0.127 0.449 0.324
STAIR Captions 0.763 0.614 0.492 0.385 0.553 0.883

Our Publication list

  • Yuya Yoshikawa, Yutaro Shigeto, Akikazu Takeuchi, ``STAIR Captions: Constructing Large-Scale Japanese Image Caption Dataset,’’ Annual Meeting of the Association for Computational Linguistics (ACL), Short Paper, 2017. (to appear)
  • 吉川友也, 重藤優太郎, 竹内彰一, ``STAIR Captions: 大規模日本語画像キャプションデータセット’’, 言語処理学会第23回年次大会 (NLP2017), 2017. (In Japanese) [PDF]


If you use STAIR Captions dataset, please cite the below paper.

  title={STAIR Captions: 大規模日本語画像キャプションデータセット},
  author={吉川友也 and 重藤優太郎 and 竹内彰一},


Chiba Institute of Technology
Software Technology and Artificial Intelligence Research Laboratory (STAIR Lab)


  • Akikazu Takeuchi, STAIR Lab
  • Yuya Yoshikawa, STAIR Lab
  • Yutaro Shigeto, Nara Institute of Science and Technology (Internship)
  • Teruhisa Kamachi, Mokha Inc.
  • Hiroshi Ogino, Scaleout Inc.
  • Student volunteers in Chiba Institute of Technology, and about 2,100 crowd workers


竹内彰一 (Akikazu Takeuchi)