SETimes.SR: A reference training corpus of Serbian
This repository contains the SETimes.SR reference training corpus of Serbian, which has been annotated on the following levels:
- Document, sentence, and token segmentation
- Parts of speech, morphosyntactic descriptors (MSDs), and morphosyntactic features, according to the MultextEast v5 (MTEv5) specification
- Syntactic dependency relations, parts of speech, and morphosyntactic features, according to the Universal Dependencies v2 (UDv2) specification
- Named entities, according to the IOB2 standard
The texts within SETimes.SR were taken from the SETimes parallel corpus, a multilingual parallel collection of news stories from the now defunct Southeast European Times news portal. Document segmentation was introduced in a previous version of the SETimes.SR repository. Sentence and token segmentation, lemmas, and MSD layers were taken from the Regional Linguistic Data Initiative repository. UDv2 dependency relation layer was taken from the UD Serbian repository. Both the MTEv5 and the UD morphosyntactic features and POS tags were semi-automatically generated using the MTE - UD mapping and code available in this repository. The same mapping was also applied to the hr500k corpus in Croatian. The named entity annotation is newly added.
The SETimes.SR corpus file is structured according to the CoNLL file format standard, with the following distribution of information across columns:
- Token index in the sentence
- MTE part-of-speech tag
- MTE morphosyntactic descriptor
- MTE morphosyntactic features
- _ (left empty to preserve formatting equivalence with the hr500k corpus, which contains older, non-UD dependency relation tags in this position)
- UD dependency relation (head:label)
- UD specific features (used to encode the SpaceAfter attribute)
- Named entity tag
The SETimes.SR corpus is also available in several different formats on the Slovenian Research Infrastructure CLARIN.SI repository:
If you wish to use the SETimes.SR corpus in your paper or project, please cite the following paper and resource references:
SETimes.SR – A Reference Training Corpus of Serbian, Vuk Batanović, Nikola Ljubešić, Tanja Samardžić, in Proceedings of the Conference on Language Technologies & Digital Humanities 2018 (JT-DH 2018), pp. 11-17, Ljubljana, Slovenia (2018).
Batanović, Vuk; Ljubešić, Nikola; Samardžić, Tanja and Erjavec, Tomaž, 2018, Training corpus SETimes.SR 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1200.
Work on the SETimes.SR corpus was supported by the Regional Linguistic Data Initiative (ReLDI) via the Swiss National Science Foundation grant no. 160501, and the Slovenian Research Infrastructure CLARIN.SI.