The SJNC (Simplified Japanese News Corpus) is a corpus developed for research on text simplification, based on articles from The Asahi Shimbun Company.
It contains approximately 7,000 sentence pairs (original sentences and simplified sentences) from around 700 articles. The simplification process is performed based on unique guidelines, ensuring that the original information is retained while making the text simpler.
Japanese page is here.
The SJNC (Simplified Japanese News Corpus) is a corpus developed for research on Japanese text simplification, based on articles from The Asahi Shimbun Company (2022).
Text simplification is a task that involves converting an input sentence into a simpler expression while retaining its original meaning.
It is known that existing simplification corpora and system-generated outputs often include examples that lack faithfulness to the original text, such as adding unrelated information or omitting important details. In response to this issue, we are working on constructing a Japanese text simplification corpus that remains faithful to the original sentences.
This data was manually constructed according to guidelines that were created with careful consideration of the trade-off between readability and faithfulness to the original text. We have confirmed that, compared to existing Japanese simplification corpora, this corpus highly preserves the information of the original sentences. Furthermore, it has been shown to be effective in generating simplified sentences faithful to the original text, both as training data for Seq2Seq models and in few-shot methods for large language models (LLMs).
SJNC v1.0 is now available for distribution.(2024.9.27)
Toru Urakawa, Yuya Taguchi, Takuro Niitsuma, Hideaki Tamori. A Japanese News Simplification Corpus with Faithfulness. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, May 2024. [link]
Please email us:
mrad-contact(atmark)asahi.com