SETimes.SR – A Reference Training Corpus of Serbian

Vuk Batanović, Nikola Ljubešić, Tanja Samardžić

Apstrakt

In this paper we present SETimes.SR – a gold standard dataset for Serbian, annotated with regard to document, sentence, and token segmentation, morphosyntax, lemmas, dependency syntax, and named entities. We describe the annotation layers and provide a basic statistical overview of them, and we discuss the method of encoding them in the CoNLL and the TEI format. In addition, we compare the SETimes.SR corpus with the older SETimes.HR dataset in Croatian.

Vrsta rada

Konferencijski rad

Publikacija

Proceedings of the Conference on Language Technologies & Digital Humanities 2018 (JT‑DH 2018), Ljubljana, Slovenia, pp. 11-17

Datum

Septembar 2018

Linkovi

PDF Slajdovi Skup podataka CLARIN repozitorijum NoSketch Engine interfejs KonText interfejs