PRODUCTS

JNC & JAMUL

JNC & JAMUL

JNC, JAMUL, and JAMUL 2020 are corpora created from articles of The Asahi Shimbun for use in research on the automatic headline and automatic summary generation.

Introduction

Japanese page is here.

JNC

Through research in automatic headline generation, we learned that data, such as pairing a headline with “the very beginning of an article (about three sentences)” rather than the entire article was useful. Because newspaper articles often have important information at the beginning of the article.

JNC is a corpus containing 1,828,231 cases that pair the first three sentences of articles published over ten years (2007-2016) with the headlines published in The Asahi Shimbun. Because JNC includes headlines of varying lengths, it is appropriate to learn data for a headline generation model that also considers output length.

The corpus price has been priced down to a reasonable level by limiting the data's uses and amount.

You need to make a contract with us and pay the license fee to use this corpus. For more information, please contact us.

Sample

JAMUL

JAMUL contains 1,524 cases of articles transmitted between September 2017 and March 2018 along with the headlines that appeared in the newspaper and the headlines of 10, 13, and 26 characters in length that were attached to the articles when produced for various digital media. This corpus's vastly unique nature is the large scale of headlines of varying lengths for each article created by professional editors. We hold expectations that this will be used for the evaluation of automatic headline generation.

We will be able to provide the JAMUL corpus for free if you agree to the terms of use.For more information, please contact us.

Sample

JAMUL (English version)

The English version of JAMUL was originally translated and developed by the Tokyo Institute of Technology’s Okazaki Laboratory, the Department of Computer Science, the School of Computing, from the JAMUL Japanese version. The Asahi Shimbun Company does not guarantee its content and accuracy. All rights with JAMUL are reserved by The Asahi Shimbun but copyright with the English version of JAMUL belongs to the Tokyo Institute of Technology.

We will be able to provide this corpus for free if you agree to the terms of use. For more information, please see this link or contact us.

Sample

JAMUL 2020

JAMUL 2020 contains 30,656 cases transmitted between May 2014 and June 2019 to the ANDES article summary service provided by The Asahi Shimbun. For each article, a maximum of five types of headlines and summaries have been attached. Along with the four types of headlines stored in JAMUL, this corpus also contains the summaries (with a maximum length of 50 characters) that were transmitted to such media as the electronic signboards. This corpus's outstanding characteristic is the headlines' varying character length limit, excluding the headlines that appeared in the print version. Those differences arose due to the device displaying the headline or the layout of the article. There are more cases than in JAMUL, which allows for evaluation and makes learning possible.

You need to make a contract with us and pay the license fee to use this corpus. For more information, please contact us.

Sample

Publications

Yuta Hitomi, Yuya Taguchi, Hideaki Tamori, Ko Kikuta, Jiro Nishitoba, Naoaki Okazaki, Kentaro Inui, Manabu Okumura. A Large-Scale Multi-Length Headline Corpus for Analyzing Length-Constrained Headline Generation Model Evaluation. In Proceedings of the 12th International Conference on Natural Language Generation (INLG 2019), Tokyo, Japan, October 2019. [link]

The filter scripts we used in the paper above are here.

Sho Takase, Naoaki Okazaki. Multi-Task Learning for Cross-Lingual Abstractive Summarization. arXiv:2010.07503. October 2020.[link]

How to obtain

Please email us:

mrad-contact(atmark)asahi.com