SETimes.SR – A Reference Training Corpus of Serbian

Apstrakt

In this paper we present SETimes.SR – a gold standard dataset for Serbian, annotated with regard to document, sentence, and token segmentation, morphosyntax, lemmas, dependency syntax, and named entities. We describe the annotation layers and provide a basic statistical overview of them, and we discuss the method of encoding them in the CoNLL and the TEI format. In addition, we compare the SETimes.SR corpus with the older SETimes.HR dataset in Croatian.

Publikacija
Proceedings of the Conference on Language Technologies & Digital Humanities 2018 (JT‑DH 2018), Ljubljana, Slovenia, pp. 11-17
Datum