The JNC is useful for training headline generation models, and the JAMUL is a corpus to evaluate headline generation models containing the news articles and their headlines of 10 characters, 13 characters, and 26 characters for digital media.
Japanese page is here.
The JNC is useful for training headline generation models, and the JAMUL is a corpus to evaluate headline generation models containing the news articles and their three headlines.
The JNC is a collection of 1,829,231 pairs of the three lead sentences of articles and their print headlines published from 2007 to 2016. Lengths of headlines in the JNC are diverse because of various factors related to publishing newspapers (e.g., space limitation, importance of the news). The tendency is important articles tend to have longer print headlines assigned.
The JNC is useful for training headline generation models because it has many training instances. Furthermore, the corpus is suitable for training a model for variable-length headline generation because of the variety of the headline lengths.
You need to make a contract with us and pay the license fee to use this corpus. For more information, please contact us.
The JAMUL is a corpus containing 1,524 news articles and their three headlines of 10 characters, 13 characters, and 26 characters for digital media. All the articles and headlines were published between September 2017 and March 2018.
The volume of the news articles may be insufficient for training a headline generation model. However, the JAMUL includes the headlines that strictly preserve the length requirements. This novel characteristic of the JAMUL is a test set for headline generation. No overlap of articles between the JNC and JAMUL is observed.
Yuta Hitomi, Yuya Taguchi, Hideaki Tamori, Ko Kikuta, Jiro Nishitoba, Naoaki Okazaki, Kentaro Inui, Manabu Okumura. A Large-Scale Multi-Length Headline Corpus for Analyzing Length-Constrained Headline Generation Model Evaluation. In Proceedings of the 12th International Conference on Natural Language Generation (INLG 2019), Tokyo, Japan, October 2019. [link]
The filter scripts we used in the paper above are here.
Please email us: