Education

University education

  • 2012 - present – PhD in natural language processing, Department of Software Engineering, School of Electrical Engineering, University of Belgrade, GPA 10/10
    Planned thesis: A methodology for solving semantic tasks in the processing of short texts written in natural languages with limited resources
  • 2010 - 2011 – Master’s degree, Department of Computer Science and Information Technology, School of Electrical Engineering, University of Belgrade, GPA 10/10
    Thesis: An expert system for determining the semantic similarity of short texts in Serbian
  • 2006 - 2010 – Bachelor’s degree, Department of Computer Science and Information Technology, School of Electrical Engineering, University of Belgrade, GPA 9.56/10
    Thesis: A visual simulator of search algorithms

Summer schools and seminars

  • MLSS 2018 – Machine Learning Summer School 2018, Universidad Autónoma de Madrid, Spain
  • ESSLLI 2018 – 30th European Summer School in Logic, Language and Information, Sofia University “St. Kl. Ohridski”, Bulgaria
  • DS3 2018 – Second Data Science Summer School, École Polytechnique, Paris, France
  • DeepLearn 2017 – International Summer School on Deep Learning 2017, University of Deusto, Rovira i Virgili University, Bilbao, Spain
  • ESSLLI 2016 – 28th European Summer School in Logic, Language and Information, Free University of Bozen-Bolzano, Italy
  • LxMLS 2016 – 6th Lisbon Machine Learning Summer School, Instituto Superior Técnico, Portugal
  • ReLDI (Regional Linguistic Data Initiative) seminars at the Faculty of Philology, University of Belgrade, Serbia, and the Faculty of Philosophy, University of Zagreb, Croatia, 2016-2017

Online courses

  • Natural Language Processing, Stanford University, Coursera
  • Natural Language Processing, Columbia University, Coursera
  • Machine Learning, Stanford University, Coursera
  • Introduction to Natural Language Processing, University of Michigan, Coursera
  • Miracles of Human Language: An Introduction to Linguistics, Universiteit Leiden, Coursera
  • Text Retrieval and Search Engines, University of Illinois at Urbana-Champaign, Coursera
  • The Data Scientist’s Toolbox, Johns Hopkins University, Coursera
  • Data Mining with Weka, University of Waikato
  • More Data Mining with Weka, University of Waikato
  • Advanced Data Mining with Weka, University of Waikato

Selected Publications

In this paper we present SETimes.SR – a gold standard dataset for Serbian, annotated with regard to document, sentence, and token segmentation, morphosyntax, lemmas, dependency syntax, and named entities. We describe the annotation layers and provide a basic statistical overview of them, and we discuss the method of encoding them in the CoNLL and the TEI format. In addition, we compare the SETimes.SR corpus with the older SETimes.HR dataset in Croatian.
In JT-DH, 2018.

Although the task of semantic textual similarity (STS) has gained in prominence in the last few years, annotated STS datasets for model training and evaluation, particularly those with fine-grained similarity scores, remain scarce for languages other than English, and practically non-existent for minor ones. In this paper, we present the Serbian Semantic Textual Similarity News Corpus (STS.news.sr) – an STS dataset for Serbian that contains 1192 sentence pairs annotated with fine-grained semantic similarity scores. We describe the process of its creation and annotation, and we analyze and compare our corpus with the existing news-based STS datasets in English and other major languages. Several existing STS models are evaluated on the Serbian STS News Corpus, and a new supervised bag-of-words model that combines part-of-speech weighting with term frequency weighting is proposed and shown to outperform similar methods. Since Serbian is a morphologically rich language, the effect of various morphological normalization tools on STS model performances is considered as well. The Serbian STS News Corpus, the annotation tool and guidelines used in its creation, and the STS model framework used in the evaluation are all made publicly available.
In LREC, 2018.

Sentiment classification of texts written in Serbian is still an under-researched topic. One of the open issues is how the different forms of morphological normalization affect the performances of different sentiment classifiers and which normalization procedure is optimal for this task. In this paper we assess and compare the impact of lemmatizers and stemmers for Serbian on classifiers trained and evaluated on the Serbian Movie Review Dataset.
In TELFOR, 2016.

Collecting data for sentiment analysis in resource-limited languages carries a significant risk of sample selection bias, since the small quantities of available data are most likely not representative of the whole population. Ignoring this bias leads to less robust machine learning classifiers and less reliable evaluation results. In this paper we present a dataset balancing algorithm that minimizes the sample selection bias by eliminating irrelevant systematic differences between the sentiment classes. We prove its superiority over the random sampling method and we use it to create the Serbian movie review dataset – SerbMR – the first balanced and topically uniform sentiment analysis dataset in Serbian. In addition, we propose an incremental way of finding the optimal combination of simple text processing options and machine learning features for sentiment classification. Several popular classifiers are used in conjunction with this evaluation approach in order to establish strong but reliable baselines for sentiment analysis in Serbian.
In LREC, 2016.

This paper presents POST STSS, a method of determining short-text semantic similarity in which part-of-speech tags are used as indicators of the deeper syntactic information usually extracted by more advanced tools like parsers and semantic role labelers. Our model employs a part-of-speech weighting scheme and is based on a statistical bag-of-words approach. It does not require either hand-crafted knowledge bases or advanced syntactic tools, which makes it easily applicable to languages with limited natural language processing resources. By using a paraphrase recognition test, we demonstrate that our system achieves a higher accuracy than all existing statistical similarity algorithms and solutions of a more structural kind.
In ComSIS, 2015.

Measuring the semantic similarity of short texts is a noteworthy problem since short texts are widely used on the Internet, in the form of product descriptions or captions, image and webpage tags, news headlines, etc. This paper describes a methodology which can be used to create a software system capable of determining the semantic similarity of two given short texts. The proposed LInSTSS approach is particularly suitable for application in situations when no large, publicly available, electronic linguistic resources can be found for the desired language. We describe the basic working principles of the system architecture we propose, as well as the stages of its construction and use. Also, we explain the procedure used to generate a paraphrase corpus which is then utilized in the evaluation process. Finally, we analyze the evaluation results obtained from a system created for the Serbian language, and we discuss possible improvements which would increase system accuracy.
In Decision Support Systems, 2013.

Publication List

Reliable Baselines for Sentiment Analysis in Resource-Limited Languages: The Serbian Movie Review Dataset

Details PDF Dataset Stemmers for Serbian and Croatian NBSVM implementation for Weka

Using Part-of-Speech Tags as Deep-Syntax Indicators in Determining Short-Text Semantic Similarity

Details PDF Dataset

Evaluation and Classification of Syntax Usage in Determining Short-Text Semantic Similarity

Details PDF Dataset

Evaluacija i klasifikacija korišćenja sintaksnih informacija u određivanju semantičke sličnosti kratkih tekstova

Details PDF Dataset

Semantic similarity of short texts in languages with a deficient natural language processing support

Details PDF Code Dataset

Softverski sistem za određivanje semantičke sličnosti kratkih tekstova na srpskom jeziku

Details PDF Dataset

Softverski sistem za učenje ekspertskih sistema

Details PDF Code

Created Datasets and Tools

Serbian AutoRIA - a model for automating the RIA mechanism for Serbian

Rapid Integrated Assessment (RIA) is a national policy document evaluation mechanism developed by the UNDP to help countries assess their readiness for the implementation of UN Sustainable Development Goals (SDG). The created model automates the RIA procedure for documents written in Serbian and is based on an earlier IBM approach developed for English. The model works by searching the documents for sentences/paragraphs that are a semantic match for one the SDG targets. The model repository also contains the Serbian national policy documents, as well as their stemmed versions.

SETimes.SR reference training corpus of Serbian

SETimes.SR reference training corpus of Serbian consists of 87 thousand tokens or close to four thousand sentences in Serbian, gathered from the (now defunct) Southeast European Times news portal. Each news story is treated as a separate document and is segmented into sentences and tokens. The entire corpus is annotated on the level of lemmas and parts of speech, morphosyntax, syntactic dependencies, and named entities. The construction of this corpus is described in a JT-DH 2018 paper.

STSFineGrain – a collection of semantic textual similarity models

STSFineGrain is a Java package that contains a collection of semantic textual similarity models and a framework for their evaluation on STS corpora with fine-grained similarity scores. Seven different STS models are implemented, including three unsupervised and four supervised models. Among the supervised models there are both previously presented algorithms, such as LInSTSS and POST STSS, as well as the new POS-TF STSS model that outperforms them. Evaluation can be performed either on an entire dataset, or via cross-validation on it. STSFineGrain currently supports POST STSS and POS-TF STSS models for texts in Serbian and in English. Other models have no such language-related restrictions. This package was presented in the LREC 2018 paper.

The Serbian STS News Corpus (STS.news.sr)

The Serbian Semantic Textual Similarity News Corpus – STS.news.sr (ISLRN 146-979-597-345-4) consists of 1192 pairs of sentences in Serbian gathered from news sources on the web. Each sentence pair was manually annotated with fine-grained semantic similarity scores on the 0–5 scale. The final scores were obtained by averaging the individual scores of five annotators. The construction of this corpus is described in the LREC 2018 paper.

STSAnno – a tool for semantic textual similarity annotation

STSAnno is a tool written in Java for offline semantic textual similarity (STS) annotation. It allows the user/annotator to assign and change semantic similarity scores of text/sentence pairs in a given corpus. This tool was presented in the LREC 2018 paper.

The Serbian Movie Review Dataset (SerbMR)

The Serbian Movie Review Dataset (SerbMR) collection consists of three movie review datasets in Serbian which were constructed for the task of sentiment analysis:
Collected movie reviews in Serbian (ISLRN 252-457-966-231-5) – an unbalanced collection of 4725 movie reviews in Serbian.
SerbMR-2C – The Serbian Movie Review Dataset (2 Classes) (ISLRN 016-049-192-514-1) – a two-class balanced sentiment analysis dataset containing 1682 movie reviews in Serbian (841 positive and 841 negative reviews).
SerbMR-3C – The Serbian Movie Review Dataset (3 Classes) (ISLRN 229-533-271-984-0) – a three-class balanced sentiment analysis dataset containing 2523 movie reviews in Serbian (841 positive, 841 neutral, and 841 negative reviews).
The construction of this dataset collection is described in the LREC 2016 paper.

SCStemmers – A collection of stemmers for Serbian and Croatian

SCStemmers is a package containing four stemming algorithms for Serbian and Croatian:
– The greedy and the optimal subsumption-based stemmers for Serbian, by Vlado Kešelj and Danko Šipka,
– A refinement of their greedy stemmer for Serbian, by Nikola Milošević,
– A stemmer for Croatian, by Nikola Ljubešić and Ivan Pandžić.
SCStemmers can be used as a standalone tool or as a plug-in for Weka. The package was presented in the LREC 2016 paper.

NBSVM-Weka – a multiclass implementation of the NBSVM classifier for Weka

NBSVM is an algorithm, originally designed for binary text/sentiment classification, which combines the Multinomial Naive Bayes (MNB) classifier with the Support Vector Machine (SVM). It does so through the element-wise multiplication of standard SVM feature vectors by the positive class/negative class ratios of MNB log-counts.
This implementation extends the original algorithm to support multiclass classification using the one-vs-all approach. It relies on the LIBLINEAR library and its Java wrapper and is designed as a package for Weka. NBSVM-Weka was presented in the LREC 2016 paper.

Part-of-speech tag-supported short-text semantic similarity (POST STSS)

POST STSS is a method of computing short-text semantic similarity (i.e. semantic textual similarity) that uses a bag-of-words approach and relies on string overlap measures and lexical distributional semantics. Similarities between individual words are weighted according to their parts of speech. The optimal POS weights are determined using an incremental, hill climbing-based technique. The only language-specific resource POST STSS requires is a part-of-speech tagger (and optionally a lemmatizer), making it applicable to most languages. Further information about the algorithm can be found in the 2015 ComSIS paper. POST STSS is implemented within the STSFineGrain package.

Language-independent Short-Text Semantic Similarity (LInSTSS)

LInSTSS is a method of computing short-text semantic similarity (i.e. semantic textual similarity) that uses a bag-of-words approach and relies on string overlap measures and lexical distributional semantics. Similarities between individual words are weighted according to word frequencies. Since it does not use any language-specific tools or resouces, LInSTSS is easily applicable to any language. Further information about the algorithm can be found in the 2013 Decision Support Systems paper. LInSTSS is implemented within the STSFineGrain package.

The Serbian Paraphrase Corpus (paraphrase.sr)

The Serbian Paraphrase Corpus – paraphrase.sr (ISLRN 192-200-046-033-9) consists of 1194 pairs of sentences gathered from news sources on the web. Each sentence pair was manually annotated with a binary similarity score that indicates whether the sentences in the pair are semantically similar enough to be considered close paraphrases. The construction of this corpus is described in the 2011 TELFOR paper and the 2013 Decision Support Systems paper.

Research Projects

CLARIN - Common Language Resources and Technology Infrastructure

The European research infrastructure CLARIN enables researchers to access language resources and tools for computational processing of European languages. I am working within the CLARIN project on the consolidation and extension of morphosyntactic, syntactic, named entity, and semantic role label annotation layers in Croatian and Serbian corpora which are published on the CLARIN.SI repository. I am also engaged with regard to the CLARIN Knowledge Centre for South Slavic languages (CLASSLA), specifically its web services.

Automating the Rapid Integrated Assessment mechanism in Serbian

Rapid Integrated Assessment (RIA) is a national policy document evaluation mechanism developed by the UNDP to help countries assess their readiness for the implementation of UN Sustainable Development Goals. The aim of this project was to automate the RIA procedure for documents written in Serbian, based on an earlier UNDP/IBM pilot project for English. The project was proposed by UN Country Team Serbia and funded via the 2018 call for innovation proposals by the UN Development Operations Coordination Office (UNDOCO). Implementation was performed in cooperation with the SeConS Development Initiative Group.

Regional Linguistic Data Initiative (ReLDI)

Regional Linguistic Data Initiative – ReLDI (SNSF SCOPES project 160501) was a two-year institutional partnership between research units in Switzerland, Serbia and Croatia. As a research collaborator, I participated in the creation, distribution and analysis of linguistic/NLP datasets and tools for Serbian and Croatian. ReLDI Centre Belgrade was founded after the conclusion of the project in order to continue the activities of this partnership.

Open Information Extraction for the Slovenian and the Serbian Language

Open Information Extraction for the Slovenian and the Serbian Language was a two-year bilateral project between the Faculty of Computer and Information Science, University of Ljubljana and the School of Electrical Engineering, University of Belgrade. As a researcher on the project, I was tasked with the creation of the first dataset in Serbian annotated with coreference relations and with its subsequent use in the construction of the first coreference resolution system for Serbian. I was also working on the same task for Croatian.

Teaching

(with prof. dr Boško Nikolić)

  • School year 2017/2018 - present – Created the teaching materials, gave lectures and practical demonstrations, and supervised student projects within the new Natural Language Processing course at the Software Engineering master’s degree study program of the School of Electrical Engineering, University of Belgrade.
  • School year 2017/2018 - present – Created a part of the teaching materials and gave a part of the lectures and practical demonstrations within the Data Mining course at the Computer Science and Information Technology master’s degree study program of the School of Electrical Engineering, University of Belgrade.
  • School year 2016/2017 - present – Created the teaching materials, gave lectures and practical demonstrations, and supervised student projects within the new Machine Learning course at the Intelligent Systems PhD study program of the University of Belgrade.
  • School year 2015/2016 - present – Supervised several bachelor’s degree and master’s degree theses in NLP/ML of students at the School of Electrical Engineering, University of Belgrade.

Skills

Programming Languages

  • Python
  • Java
  • C++
  • C#
  • C
  • Matlab
  • SQL

ML/NLP Tools and Frameworks

Annotation Tools

Language Proficiency

  • Serbian (native)
  • English (fluent – C2, Cambridge Certificate of Proficiency in English (CPE), grade A)
  • French (limited)

Other Information

Awards, Grants and Scholarships

  • MLSS grant for MLSS 2018 participation
  • EACL grant for ESSLLI 2018 participation
  • TELFOR 2016 Blažo Mirčevski award for the best paper by a young author
  • Jožef Stefan Institute/CLARIN project grant for the consolidation and enlargement of language resources in Croatian and Serbian
  • ReLDI project grant for the creation of language resources in Serbian and Croatian
  • Elsevier grant for ESSLLI 2016 participation
  • 2010 Scholarship of the Fund for Young Talents of the Republic of Serbia

Membership in Professional Organizations

  • ACL SIGSLAV – Association for Computational Linguistics Special Interest Group on Slavic Natural Language Processing

Peer-review Activities

Contact