Lemma | Vuk Batanović

SETimes.SR reference training corpus of Serbian

SETimes.SR reference training corpus of Serbian consists of 87 thousand tokens or close to four thousand sentences in Serbian, gathered from the (now defunct) Southeast European Times news portal. Each news story is treated as a separate document and is segmented into sentences and tokens. The entire corpus is annotated on the level of lemmas and parts of speech, morphosyntax, syntactic dependencies, and named entities. The construction of this corpus is described in a JT-DH 2018 paper.