SETimes.SR – A Reference Training Corpus of Serbian


In this paper we present SETimes.SR – a gold standard dataset for Serbian, annotated with regard to document, sentence, and token segmentation, morphosyntax, lemmas, dependency syntax, and named entities. We describe the annotation layers and provide a basic statistical overview of them, and we discuss the method of encoding them in the CoNLL and the TEI format. In addition, we compare the SETimes.SR corpus with the older SETimes.HR dataset in Croatian.

Proceedings of the Conference on Language Technologies & Digital Humanities 2018 (JT‑DH 2018), Ljubljana, Slovenia, pp. 11-17