Algorithm

Serbian AutoRIA - a model for automating the RIA mechanism for Serbian

Rapid Integrated Assessment (RIA) is a national policy document evaluation mechanism developed by the UNDP to help countries assess their readiness for the implementation of UN Sustainable Development Goals (SDG). The created model automates the RIA procedure for documents written in Serbian and is based on an earlier IBM approach developed for English. The model works by searching the documents for sentences / paragraphs that are a semantic match for one the SDG targets. The model repository also contains the Serbian national policy documents, as well as their stemmed versions. Further information can be found in the LT4All paper.

STSFineGrain – a collection of semantic textual similarity models

STSFineGrain is a Java package that contains a collection of semantic textual similarity models and a framework for their evaluation on STS corpora with fine-grained similarity scores. Seven different STS models are implemented, including three unsupervised and four supervised models. Among the supervised models there are both previously presented algorithms, such as LInSTSS and POST STSS, as well as the new POS-TF STSS model that outperforms them. Evaluation can be performed either on an entire dataset, or via cross-validation on it. STSFineGrain currently supports POST STSS and POS-TF STSS models for texts in Serbian and in English. Other models have no such language-related restrictions. This package was presented in the LREC 2018 paper.

SCStemmers – A collection of stemmers for Serbian and Croatian

SCStemmers is a package containing four stemming algorithms for Serbian and Croatian:
– The greedy and the optimal subsumption-based stemmers for Serbian, by Vlado Kešelj and Danko Šipka,
– A refinement of their greedy stemmer for Serbian, by Nikola Milošević,
– A stemmer for Croatian, by Nikola Ljubešić and Ivan Pandžić.
SCStemmers can be used as a standalone tool or as a plug-in for Weka. The package was presented in the LREC 2016 paper.

NBSVM-Weka – a multiclass implementation of the NBSVM classifier for Weka

NBSVM is an algorithm, originally designed for binary text/sentiment classification, which combines the Multinomial Naive Bayes (MNB) classifier with the Support Vector Machine (SVM). It does so through the element-wise multiplication of standard SVM feature vectors by the positive class/negative class ratios of MNB log-counts.
This implementation extends the original algorithm to support multiclass classification using the one-vs-all approach. It relies on the LIBLINEAR library and its Java wrapper and is designed as a package for Weka. NBSVM-Weka was presented in the LREC 2016 paper.

Part-of-speech tag-supported short-text semantic similarity (POST STSS)

POST STSS is a method of computing short-text semantic similarity (i.e. semantic textual similarity) that uses a bag-of-words approach and relies on string overlap measures and lexical distributional semantics. Similarities between individual words are weighted according to their parts of speech. The optimal POS weights are determined using an incremental, hill climbing-based technique. The only language-specific resource POST STSS requires is a part-of-speech tagger (and optionally a lemmatizer), making it applicable to most languages. Further information about the algorithm can be found in the 2015 ComSIS paper. POST STSS is implemented within the STSFineGrain package.

Language-independent Short-Text Semantic Similarity (LInSTSS)

LInSTSS is a method of computing short-text semantic similarity (i.e. semantic textual similarity) that uses a bag-of-words approach and relies on string overlap measures and lexical distributional semantics. Similarities between individual words are weighted according to word frequencies. Since it does not use any language-specific tools or resouces, LInSTSS is easily applicable to any language. Further information about the algorithm can be found in the 2013 Decision Support Systems paper. LInSTSS is implemented within the STSFineGrain package.