Serbian AutoRIA - a model for automating the RIA mechanism for Serbian

Rapid Integrated Assessment (RIA) is a national policy document evaluation mechanism developed by the UNDP to help countries assess their readiness for the implementation of UN Sustainable Development Goals (SDG). The created model automates the RIA procedure for documents written in Serbian and is based on an earlier IBM approach developed for English. The model works by searching the documents for sentences / paragraphs that are a semantic match for one the SDG targets. The model repository also contains the Serbian national policy documents, as well as their stemmed versions. Further information can be found in the LT4All paper.

SETimes.SR reference training corpus of Serbian

SETimes.SR reference training corpus of Serbian consists of 87 thousand tokens or close to four thousand sentences in Serbian, gathered from the (now defunct) Southeast European Times news portal. Each news story is treated as a separate document and is segmented into sentences and tokens. The entire corpus is annotated on the level of lemmas and parts of speech, morphosyntax, syntactic dependencies, and named entities. The construction of this corpus is described in a JT-DH 2018 paper.

SCStemmers – A collection of stemmers for Serbian and Croatian

SCStemmers is a package containing four stemming algorithms for Serbian and Croatian:
– The greedy and the optimal subsumption-based stemmers for Serbian, by Vlado Kešelj and Danko Šipka,
– A refinement of their greedy stemmer for Serbian, by Nikola Milošević,
– A stemmer for Croatian, by Nikola Ljubešić and Ivan Pandžić.
SCStemmers can be used as a standalone tool or as a plug-in for Weka. The package was presented in the LREC 2016 paper.