hr500k – A Reference Training Corpus of Croatian

Nikola Ljubešić, Željko Agić, Filip Klubička, Vuk Batanović, Tomaž Erjavec

Apstrakt

In this paper we present hr500k, a Croatian reference training corpus of 500 thousand tokens, segmented at document, sentence and word level, and annotated for morphosyntax, lemmas, dependency syntax, named entities, and semantic roles. We present each annotation layer via basic label statistics and describe the final encoding of the resource in CoNLL and TEI formats. We also give a description of the rather turbulent history of the resource and give insights into the topic and genre distribution in the corpus. Finally, we discuss further enrichments of the corpus with additional layers, which are already underway.

Vrsta rada

Konferencijski rad

Publikacija

Proceedings of the Conference on Language Technologies & Digital Humanities 2018 (JT‑DH 2018), Ljubljana, Slovenia, pp. 154-161

Datum

Septembar 2018

Linkovi

PDF Slajdovi Skup podataka CLARIN repozitorijum NoSketch Engine interfejs KonText interfejs