My field of expertise is natural language processing, particularly semantic tasks like semantic similarity and sentiment analysis. My research also involves other problems, such as distributional semantics, coreference resolution, text classification, and the impact of morphological normalization on semantic tasks. One of my main points of interest is dealing with the particularities of short-text processing. In addition, I am focused on creating solutions which are easily applicable not only to English, but to other, less prominent languages as well.
I work as a researcher at the Innovation Center of the School of Electrical Engineering, University of Belgrade, Serbia. I am also a co-founder and the vice president of the Regional Linguistic Data Initiative Centre in Belgrade, an NGO dedicated to developing and promoting language resources and technologies, as well as organizing seminars and tutorials regarding their use. In addition, I am the lead NLP engineer at Bravo Systems, where I head a team of engineers and linguists in developing NLP solutions for the digital advertising industry.
PhD in Software Engineering, 2020
School of Electrical Engineering, University of Belgrade
Master's degree in Computer Science and Information Technology, 2011
School of Electrical Engineering, University of Belgrade
Bachelor's degree in Computer Science and Information Technology, 2010
School of Electrical Engineering, University of Belgrade
Details PDF Official Repository STS.news.sr corpus SentiComments.SR dataset Stemmers for Serbian and Croatian STSFineGrain package STSAnno tool
Details PDF Code Dataset Serbian web corpus srWaC ReLDI tokenizer for Serbian Stemmers for Serbian and Croatian BTagger for Serbian HunPos and CST models for Croatian ReLDI tagger and lemmatizer for Serbian and Croatian
Details PDF Slides Video SETimes.SR corpus ReLDI-NormTagNER-sr corpus STS.news.sr corpus paraphrase.sr corpus Serbian Movie Review (SerbMR) corpus SentiComments.SR corpus Web corpus srWaC Diacritic restoration tool Stemmers for Serbian and Croatian CLASSLA package STSFineGrain package ReLDIanno web service
Details PDF Code Dataset Transliterator for the Serbian Cyrillic/Latin script Stemmers for Serbian and Croatian
Details PDF Slides Dataset CLARIN repository NoSketch Engine interface KonText interface
Details PDF Slides Dataset CLARIN repository NoSketch Engine interface KonText interface
Details PDF Code Dataset STSAnno annotation tool STS annotation guidelines Serbian web corpus srWaC ReLDI tokenizer for Serbian Stemmers for Serbian and Croatian BTagger for Serbian HunPos and CST models for Croatian ReLDI tagger and lemmatizer for Serbian and Croatian
Details PDF Dataset Serbian web corpus srWaC ReLDI tokenizer for Serbian Stemmers for Serbian and Croatian BTagger for Serbian HunPos and CST models for Croatian ReLDI tagger and lemmatizer for Serbian and Croatian NBSVM implementation for Weka
Details PDF Dataset ReLDI tokenizer for Serbian Stemmers for Serbian and Croatian BTagger for Serbian HunPos and CST models for Croatian ReLDI tagger and lemmatizer for Serbian and Croatian NBSVM implementation for Weka
Details PDF Dataset Stemmers for Serbian and Croatian NBSVM implementation for Weka
The SentiComments.SR dataset includes the following three corpora of short texts annotated for the task of sentiment analysis:
The main SentiComments.SR corpus, consisting of 3490 movie-related comments;
The movie verification corpus, consisting of 464 movie-related comments;
The book verification corpus, consisting of 173 book-related comments.
Six sentiment labels were used in dataset annotation: +1, -1, +M, -M, +NS, and -NS, with the addition of an ‘s’ label suffix denoting the presence of sarcasm. The main corpus was annotated by two annotators working together, and therefore contains a single, unified sentiment label for each comment. The verification corpora were used to evaluate the quality, efficiency, and cost-effectiveness of the annotation framework, which is why they contain separate sentiment labels for six annotators. The construction of this dataset is described in the 2020 PLoS ONE paper.
Rapid Integrated Assessment (RIA) is a national policy document evaluation mechanism developed by the UNDP to help countries assess their readiness for the implementation of UN Sustainable Development Goals (SDG). The created model automates the RIA procedure for documents written in Serbian and is based on an earlier IBM approach developed for English. The model works by searching the documents for sentences / paragraphs that are a semantic match for one the SDG targets. The model repository also contains the Serbian national policy documents, as well as their stemmed versions. Further information can be found in the LT4All paper.
SETimes.SR reference training corpus of Serbian consists of 87 thousand tokens or close to four thousand sentences in Serbian, gathered from the (now defunct) Southeast European Times news portal. Each news story is treated as a separate document and is segmented into sentences and tokens. The entire corpus is annotated on the level of lemmas and parts of speech, morphosyntax, syntactic dependencies, and named entities. The construction of this corpus is described in a JT-DH 2018 paper.
STSFineGrain is a Java package that contains a collection of semantic textual similarity models and a framework for their evaluation on STS corpora with fine-grained similarity scores. Seven different STS models are implemented, including three unsupervised and four supervised models. Among the supervised models there are both previously presented algorithms, such as LInSTSS and POST STSS, as well as the new POS-TF STSS model that outperforms them. Evaluation can be performed either on an entire dataset, or via cross-validation on it. STSFineGrain currently supports POST STSS and POS-TF STSS models for texts in Serbian and in English. Other models have no such language-related restrictions. This package was presented in the LREC 2018 paper.
The Serbian Semantic Textual Similarity News Corpus – STS.news.sr (ISLRN 146-979-597-345-4) consists of 1192 pairs of sentences in Serbian gathered from news sources on the web. Each sentence pair was manually annotated with fine-grained semantic similarity scores on the 0–5 scale. The final scores were obtained by averaging the individual scores of five annotators. The construction of this corpus is described in the LREC 2018 paper.
STSAnno is a tool written in Java for offline semantic textual similarity (STS) annotation. It allows the user/annotator to assign and change semantic similarity scores of text/sentence pairs in a given corpus. This tool was presented in the LREC 2018 paper.
The Serbian Movie Review Dataset (SerbMR) collection consists of three movie review datasets in Serbian which were constructed for the task of sentiment analysis:
Collected movie reviews in Serbian (ISLRN 252-457-966-231-5) – an unbalanced collection of 4725 movie reviews in Serbian.
SerbMR-2C – The Serbian Movie Review Dataset (2 Classes) (ISLRN 016-049-192-514-1) – a two-class balanced sentiment analysis dataset containing 1682 movie reviews in Serbian (841 positive and 841 negative reviews).
SerbMR-3C – The Serbian Movie Review Dataset (3 Classes) (ISLRN 229-533-271-984-0) – a three-class balanced sentiment analysis dataset containing 2523 movie reviews in Serbian (841 positive, 841 neutral, and 841 negative reviews).
The construction of this dataset collection is described in the LREC 2016 paper.
SCStemmers is a package containing four stemming algorithms for Serbian and Croatian:
– The greedy and the optimal subsumption-based stemmers for Serbian, by Vlado Kešelj and Danko Šipka,
– A refinement of their greedy stemmer for Serbian, by Nikola Milošević,
– A stemmer for Croatian, by Nikola Ljubešić and Ivan Pandžić.
SCStemmers can be used as a standalone tool or as a plug-in for Weka. The package was presented in the LREC 2016 paper.
NBSVM is an algorithm, originally designed for binary text/sentiment classification, which combines the Multinomial Naive Bayes (MNB) classifier with the Support Vector Machine (SVM). It does so through the element-wise multiplication of standard SVM feature vectors by the positive class/negative class ratios of MNB log-counts.
This implementation extends the original algorithm to support multiclass classification using the one-vs-all approach. It relies on the LIBLINEAR library and its Java wrapper and is designed as a package for Weka. NBSVM-Weka was presented in the LREC 2016 paper.
POST STSS is a method of computing short-text semantic similarity (i.e. semantic textual similarity) that uses a bag-of-words approach and relies on string overlap measures and lexical distributional semantics. Similarities between individual words are weighted according to their parts of speech. The optimal POS weights are determined using an incremental, hill climbing-based technique. The only language-specific resource POST STSS requires is a part-of-speech tagger (and optionally a lemmatizer), making it applicable to most languages. Further information about the algorithm can be found in the 2015 ComSIS paper. POST STSS is implemented within the STSFineGrain package.
LInSTSS is a method of computing short-text semantic similarity (i.e. semantic textual similarity) that uses a bag-of-words approach and relies on string overlap measures and lexical distributional semantics. Similarities between individual words are weighted according to word frequencies. Since it does not use any language-specific tools or resouces, LInSTSS is easily applicable to any language. Further information about the algorithm can be found in the 2013 Decision Support Systems paper. LInSTSS is implemented within the STSFineGrain package.
The Serbian Paraphrase Corpus – paraphrase.sr (ISLRN 192-200-046-033-9) consists of 1194 pairs of sentences gathered from news sources on the web. Each sentence pair was manually annotated with a binary similarity score that indicates whether the sentences in the pair are semantically similar enough to be considered close paraphrases. The construction of this corpus is described in the 2011 TELFOR paper and the 2013 Decision Support Systems paper.
This project of the Research Centre of the Slovenian Academy of Sciences and Arts investigates the links in conceptions of language and nation in the post-Yugoslav space, spanning six states (Slovenia, Croatia, Serbia, Bosnia & Herzegovina, Montenegro, Macedonia), and zooming in on news media texts and connected social-media citizen discourses. On this project, I have coordinated the collection, curation and publishing of specialized corpora of news media texts and digital citizen comments focused on the topic of language. Such corpora have been constructed using a standardized methodology across multiple languages, including Serbian, Croatian, and Slovenian.
Advancing Novel Textual Similarity-based Solutions in Software Development (AVANTES) is a two-year project supported by the Science Fund of the Republic of Serbia which aims to develop various natural language processing (NLP) tools and techniques for use in software development. The main research question the project tackles is the relationship between programming code semantics and the meaning of code comments written in a natural language. Within the scope of the project, several NLP tasks are to be considered, including code comment categorization according to a comment type taxonomy, comment pair similarity, calculated using cross-level semantic similarity methods, and semantic code searching. In addition, the project will focus on the identification of various types of code clones. All of these research goals are to be addressed across multiple programming (C/C++/C#, Java, JavaScript, PHP, Python, SQL) and natural languages (English and Serbian). On this project, I am tasked with overseeing and leading the development of NLP tools and annotated datasets for NLP problems.
The European research infrastructure CLARIN enables researchers to access language resources and tools for computational processing of European languages. I am working within the CLARIN project on the consolidation and extension of morphosyntactic, syntactic, named entity, and semantic role label annotation layers in Croatian and Serbian corpora which are published on the CLARIN.SI repository. I am also engaged with regard to the CLARIN Knowledge Centre for South Slavic languages (CLASSLA), specifically its web services.
Rapid Integrated Assessment (RIA) is a national policy document evaluation mechanism developed by the UNDP to help countries assess their readiness for the implementation of UN Sustainable Development Goals. The aim of this project was to automate the RIA procedure for documents written in Serbian, based on an earlier UNDP/IBM pilot project for English. The project was proposed by UN Country Team Serbia and funded via the 2018 call for innovation proposals by the UN Development Operations Coordination Office (UNDOCO). Implementation was performed in cooperation with the SeConS Development Initiative Group.
Regional Linguistic Data Initiative – ReLDI (SNSF SCOPES project 160501) was a two-year institutional partnership between research units in Switzerland, Serbia and Croatia. As a research collaborator, I participated in the creation, distribution and analysis of linguistic/NLP datasets and tools for Serbian and Croatian. ReLDI Centre Belgrade was founded after the conclusion of the project in order to continue the activities of this partnership.
(with prof. dr Boško Nikolić)