View on GitHub

SETimes.SR

A Reference Training Corpus of Serbian

SETimes.SR: A reference training corpus of Serbian

This repository contains the SETimes.SR reference training corpus of Serbian, which has been annotated on the following levels:

Annotation sources

The texts within SETimes.SR were taken from the SETimes parallel corpus, a multilingual parallel collection of news stories from the now defunct Southeast European Times news portal. Document segmentation was introduced in a previous version of the SETimes.SR repository. Sentence and token segmentation, lemmas, and MSD layers were taken from the Regional Linguistic Data Initiative repository. UDv2 dependency relation layer was taken from the UD Serbian repository. Both the MTEv5 and the UD morphosyntactic features and POS tags were semi-automatically generated using the MTE - UD mapping and code available in this repository. The same mapping was also applied to the hr500k corpus in Croatian. The named entity annotation is newly added.

Structure

The SETimes.SR corpus file is structured according to the CoNLL file format standard, with the following distribution of information across columns:

  1. Token index in the sentence
  2. Token
  3. Lemma
  4. MTE part-of-speech tag
  5. MTE morphosyntactic descriptor
  6. MTE morphosyntactic features
  7. _ (left empty to preserve formatting equivalence with the hr500k corpus, which contains older, non-UD dependency relation tags in this position)
  8. UD dependency relation (head:label)
  9. UD specific features (used to encode the SpaceAfter attribute)
  10. Named entity tag

Other formats

The SETimes.SR corpus is also available in several different formats on the Slovenian Research Infrastructure CLARIN.SI repository:

References

If you wish to use the SETimes.SR corpus in your paper or project, please cite the following paper and resource references:

Acknowledgement

Work on the SETimes.SR corpus was supported by the Regional Linguistic Data Initiative (ReLDI) via the Swiss National Science Foundation grant no. 160501, and the Slovenian Research Infrastructure CLARIN.SI.

License

Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)