Vuk Batanović

Research Associate

Innovation Center, School of Electrical Engineering, University of Belgrade

Co-founder and Vice President

Regional Linguistic Data Initiative Centre

Lead NLP Engineer

Bravo Systems / Oddbytes

Biography

My field of expertise is natural language processing, particularly semantic tasks like semantic similarity and sentiment analysis. My research also involves other problems, such as named entity recognition, text classification, coreference resolution, and the impact of morphological normalization on semantic tasks. One of my main points of interest is dealing with the particularities of short-text processing. In addition, I am focused on creating solutions which are easily applicable not only to English, but to other, less prominent languages as well.

I work as a research associate at the Innovation Center of the School of Electrical Engineering, University of Belgrade, Serbia. I am also a co-founder and the vice president of the Regional Linguistic Data Initiative Centre in Belgrade, an NGO dedicated to developing and promoting language resources and technologies, as well as organizing seminars and tutorials regarding their use. In addition, I am the lead NLP engineer at Bravo Systems / Oddbytes, where I head a team of engineers and linguists in developing NLP solutions for the digital advertising industry.

Interests

Natural Language Processing / Computational Linguistics
Sentiment Analysis
Semantic Similarity
Named Entity Recognition
Short-Text Processing
Morphological Normalization
NLP for Minor Languages
Multilingual NLP Solutions
Coreference Resolution
(Deep) Machine Learning

Education

PhD in Software Engineering, 2020

School of Electrical Engineering, University of Belgrade
Master's degree in Computer Science and Information Technology, 2011

School of Electrical Engineering, University of Belgrade
Bachelor's degree in Computer Science and Information Technology, 2010

School of Electrical Engineering, University of Belgrade

Education

University education

2012 - 2020 – PhD in Software Engineering, School of Electrical Engineering, University of Belgrade, GPA 10/10
PhD thesis: A methodology for solving semantic tasks in the processing of short texts written in natural languages with limited resources
2010 - 2011 – Master’s degree in Computer Science and Information Technology, School of Electrical Engineering, University of Belgrade, GPA 10/10
Thesis: An expert system for determining the semantic similarity of short texts in Serbian
2006 - 2010 – Bachelor’s degree in Computer Science and Information Technology, School of Electrical Engineering, University of Belgrade, GPA 9.56/10
Thesis: A visual simulator of search algorithms

Summer schools and seminars

AthNLP 2019 – 1st Athens Natural Language Processing Summer School, NCSR “Demokritos”, Greece
LAMBDA Big Data Analytics Summer School 2019, Mihajlo Pupin Institute, Belgrade, Serbia
MLSS 2018 – Machine Learning Summer School 2018, Universidad Autónoma de Madrid, Spain
ESSLLI 2018 – 30th European Summer School in Logic, Language and Information, Sofia University “St. Kl. Ohridski”, Bulgaria
DS³ 2018 – Second Data Science Summer School, École Polytechnique, Paris, France
DeepLearn 2017 – International Summer School on Deep Learning 2017, University of Deusto, Rovira i Virgili University, Bilbao, Spain
ESSLLI 2016 – 28th European Summer School in Logic, Language and Information, Free University of Bozen-Bolzano, Italy
LxMLS 2016 – 6th Lisbon Machine Learning Summer School, Instituto Superior Técnico, Portugal
ReLDI (Regional Linguistic Data Initiative) seminars at the Faculty of Philology, University of Belgrade, Serbia, and the Faculty of Philosophy, University of Zagreb, Croatia, 2016-2017

Selected Publications

Monolingual, multilingual and cross-lingual code comment classification

Code comments are one of the most useful forms of documentation and metadata for understanding software implementation. Previous research on code comment classification has focused only on comments in English, typically extracted from a few programming languages. This paper addresses the problem of code comment classification not only in the monolingual setting, but also in the multilingual and cross-lingual one, in order to examine whether they can outperform the traditional monolingual approach. To tackle this task, we introduce a novel, publicly available code comment dataset, consisting of over 10,000 code comments collected from software projects written in eight programming languages (C, C++, C#, Java, JavaScript/TypeScript, PHP, Python, and SQL). About half of them are written in Serbian while the other half are written in English. This dataset was manually annotated according to a newly proposed taxonomy of code comment categories. We fine-tune and evaluate multiple monolingual and multilingual pre-trained neural language models on the code comment classification task and compare their performances to several baselines. The best results for Serbian comments are obtained using the monolingual neural model BERTić, trained on Serbian and closely related languages. On the other hand, the optimal choice for English is the multilingual neural model multilingual BERT, which successfully extracts useful patterns from data in both languages. Although the cross-lingual setting shows some promise for simple binary classification, it has yet to reach sufficiently high performance levels for practical use. We also analyze model performance across different programming languages.

Marija Kostić, Vuk Batanović, Boško Nikolić

In EAAI, 2023.

Details PDF Code Dataset ReLDI tokenizer for Serbian Stemmers for Serbian and Croatian Lemmatizer for Serbian FastText word embeddings for Serbian (Serbian web corpus srWaC) FastText word embeddings for Serbian (Common Crawl) FastText word embeddings for English BERTić LLM for Serbian ELECTRA LLM for English Multilingual BERT LLM XLM-RoBERTa LLM

A methodology for solving semantic tasks in the processing of short texts written in natural languages with limited resources

Statistical approaches to natural language processing typically require considerable amounts of labeled data, and often various auxiliary language tools as well, limiting their applicability in resource-limited settings. This thesis presents a methodology for developing statistical solutions in the semantic processing of natural languages with limited resources. In these languages, not only are existing language resources limited, but so are the capabilities for developing new datasets and dedicated tools and algorithms. The proposed methodology focuses on short texts due to their prevalence in digital communication, as well as the greater complexity regarding their semantic processing.
The methodology encompasses all phases in the creation of statistical solutions, from the collection of textual content, to data annotation, to the formulation, training, and evaluation of machine learning models. Its use is illustrated in detail on two semantic tasks – sentiment analysis and semantic textual similarity. The Serbian language is utilized as an example of a language with limited resources, but the proposed methodology can also be applied to other languages in this category.
In addition to the general methodology, the contributions of this thesis consist of the development of a new, flexible short-text sentiment annotation system, a new annotation cost-effectiveness metric, as well as several new semantic textual similarity models. The thesis results also include the creation of the first publicly available annotated datasets of short texts in Serbian for the tasks of sentiment analysis and semantic textual similarity, the development and evaluation of numerous models on these tasks, and the first comparative evaluation of multiple morphological normalization tools on short texts in Serbian.

Vuk Batanović

PhD Thesis, University of Belgrade - School of Electrical Engineering, 2020.

Details PDF Official Repository STS.news.sr corpus SentiComments.SR dataset Stemmers for Serbian and Croatian STSFineGrain package STSAnno tool

A versatile framework for resource-limited sentiment articulation, annotation, and analysis of short texts

Choosing a comprehensive and cost-effective way of articulating and annotating the sentiment of a text is not a trivial task, particularly when dealing with short texts, in which sentiment can be expressed through a wide variety of linguistic and rhetorical phenomena. This problem is especially conspicuous in resource-limited settings and languages, where design options are restricted either in terms of manpower and financial means required to produce appropriate sentiment analysis resources, or in terms of available language tools, or both. In this paper, we present a versatile approach to addressing this issue, based on multiple interpretations of sentiment labels that encode information regarding the polarity, subjectivity, and ambiguity of a text, as well as the presence of sarcasm or a mixture of sentiments. We demonstrate its use on Serbian, a resource-limited language, via the creation of a main sentiment analysis dataset focused on movie comments, and two smaller datasets belonging to the movie and book domains. In addition to measuring the quality of the annotation process, we propose a novel metric to validate its cost-effectiveness. Finally, the practicality of our approach is further validated by training, evaluating, and determining the optimal configurations of several different kinds of machine-learning models on a range of sentiment classification tasks using the produced dataset.

Vuk Batanović, Miloš Cvetanović, Boško Nikolić

In PLoS ONE, 2020.

Details PDF Code Dataset Serbian web corpus srWaC ReLDI tokenizer for Serbian Stemmers for Serbian and Croatian BTagger for Serbian HunPos and CST models for Croatian ReLDI tagger and lemmatizer for Serbian and Croatian

SETimes.SR – A Reference Training Corpus of Serbian

In this paper we present SETimes.SR – a gold standard dataset for Serbian, annotated with regard to document, sentence, and token segmentation, morphosyntax, lemmas, dependency syntax, and named entities. We describe the annotation layers and provide a basic statistical overview of them, and we discuss the method of encoding them in the CoNLL and the TEI format. In addition, we compare the SETimes.SR corpus with the older SETimes.HR dataset in Croatian.

Vuk Batanović, Nikola Ljubešić, Tanja Samardžić

In JT-DH, 2018.

Details PDF Slides Dataset CLARIN repository NoSketch Engine interface KonText interface

Fine-grained Semantic Textual Similarity for Serbian

Although the task of semantic textual similarity (STS) has gained in prominence in the last few years, annotated STS datasets for model training and evaluation, particularly those with fine-grained similarity scores, remain scarce for languages other than English, and practically non-existent for minor ones. In this paper, we present the Serbian Semantic Textual Similarity News Corpus (STS.news.sr) – an STS dataset for Serbian that contains 1192 sentence pairs annotated with fine-grained semantic similarity scores. We describe the process of its creation and annotation, and we analyze and compare our corpus with the existing news-based STS datasets in English and other major languages. Several existing STS models are evaluated on the Serbian STS News Corpus, and a new supervised bag-of-words model that combines part-of-speech weighting with term frequency weighting is proposed and shown to outperform similar methods. Since Serbian is a morphologically rich language, the effect of various morphological normalization tools on STS model performances is considered as well. The Serbian STS News Corpus, the annotation tool and guidelines used in its creation, and the STS model framework used in the evaluation are all made publicly available.

Vuk Batanović, Miloš Cvetanović, Boško Nikolić

In LREC, 2018.

Details PDF Code Dataset STSAnno annotation tool STS annotation guidelines Serbian web corpus srWaC ReLDI tokenizer for Serbian Stemmers for Serbian and Croatian BTagger for Serbian HunPos and CST models for Croatian ReLDI tagger and lemmatizer for Serbian and Croatian

Reliable Baselines for Sentiment Analysis in Resource-Limited Languages: The Serbian Movie Review Dataset

Collecting data for sentiment analysis in resource-limited languages carries a significant risk of sample selection bias, since the small quantities of available data are most likely not representative of the whole population. Ignoring this bias leads to less robust machine learning classifiers and less reliable evaluation results. In this paper we present a dataset balancing algorithm that minimizes the sample selection bias by eliminating irrelevant systematic differences between the sentiment classes. We prove its superiority over the random sampling method and we use it to create the Serbian movie review dataset – SerbMR – the first balanced and topically uniform sentiment analysis dataset in Serbian. In addition, we propose an incremental way of finding the optimal combination of simple text processing options and machine learning features for sentiment classification. Several popular classifiers are used in conjunction with this evaluation approach in order to establish strong but reliable baselines for sentiment analysis in Serbian.

Vuk Batanović, Boško Nikolić, Milan Milosavljević

In LREC, 2016.

Details PDF Dataset Stemmers for Serbian and Croatian NBSVM implementation for Weka

Publication List

Monolingual, multilingual and cross-lingual code comment classification

Details PDF Code Dataset ReLDI tokenizer for Serbian Stemmers for Serbian and Croatian Lemmatizer for Serbian FastText word embeddings for Serbian (Serbian web corpus srWaC) FastText word embeddings for Serbian (Common Crawl) FastText word embeddings for English BERTić LLM for Serbian ELECTRA LLM for English Multilingual BERT LLM XLM-RoBERTa LLM

A methodology for solving semantic tasks in the processing of short texts written in natural languages with limited resources

Details PDF Official Repository STS.news.sr corpus SentiComments.SR dataset Stemmers for Serbian and Croatian STSFineGrain package STSAnno tool

A versatile framework for resource-limited sentiment articulation, annotation, and analysis of short texts

Details PDF Code Dataset Serbian web corpus srWaC ReLDI tokenizer for Serbian Stemmers for Serbian and Croatian BTagger for Serbian HunPos and CST models for Croatian ReLDI tagger and lemmatizer for Serbian and Croatian

Open Resources and Technologies for Serbian Language Processing

Details PDF Slides Video SETimes.SR corpus ReLDI-NormTagNER-sr corpus STS.news.sr corpus paraphrase.sr corpus Serbian Movie Review (SerbMR) corpus SentiComments.SR corpus Web corpus srWaC Diacritic restoration tool Stemmers for Serbian and Croatian CLASSLA package STSFineGrain package ReLDIanno web service

Using Language Technologies to Automate the UNDP Rapid Integrated Assessment Mechanism in Serbian

Details PDF Code Dataset Transliterator for the Serbian Cyrillic/Latin script Stemmers for Serbian and Croatian

The "ReLDI effect": Collaborative development of manually annotated datasets for Slovene, Croatian and Serbian

Details PDF

SETimes.SR – A Reference Training Corpus of Serbian

Details PDF Slides Dataset CLARIN repository NoSketch Engine interface KonText interface

hr500k – A Reference Training Corpus of Croatian

Details PDF Slides Dataset CLARIN repository NoSketch Engine interface KonText interface

Fine-grained Semantic Textual Similarity for Serbian

Details PDF Code Dataset STSAnno annotation tool STS annotation guidelines Serbian web corpus srWaC ReLDI tokenizer for Serbian Stemmers for Serbian and Croatian BTagger for Serbian HunPos and CST models for Croatian ReLDI tagger and lemmatizer for Serbian and Croatian

Sentiment Classification of Documents in Serbian: The Effects of Morphological Normalization and Word Embeddings

Details PDF Dataset Serbian web corpus srWaC ReLDI tokenizer for Serbian Stemmers for Serbian and Croatian BTagger for Serbian HunPos and CST models for Croatian ReLDI tagger and lemmatizer for Serbian and Croatian NBSVM implementation for Weka

Sentiment Classification of Documents in Serbian: The Effects of Morphological Normalization

Details PDF Dataset ReLDI tokenizer for Serbian Stemmers for Serbian and Croatian BTagger for Serbian HunPos and CST models for Croatian ReLDI tagger and lemmatizer for Serbian and Croatian NBSVM implementation for Weka

Reliable Baselines for Sentiment Analysis in Resource-Limited Languages: The Serbian Movie Review Dataset

Details PDF Dataset Stemmers for Serbian and Croatian NBSVM implementation for Weka

Using Part-of-Speech Tags as Deep-Syntax Indicators in Determining Short-Text Semantic Similarity

Details PDF Dataset

Evaluation and Classification of Syntax Usage in Determining Short-Text Semantic Similarity

Details PDF Dataset

Evaluacija i klasifikacija korišćenja sintaksnih informacija u određivanju semantičke sličnosti kratkih tekstova

Details PDF Dataset

Semantic similarity of short texts in languages with a deficient natural language processing support

Details PDF Code Dataset

Softverski sistem za određivanje semantičke sličnosti kratkih tekstova na srpskom jeziku

Details PDF Dataset

Softverski sistem za učenje ekspertskih sistema

Details PDF Code

Created Datasets and Tools

SentiComments.SR - A Sentiment Analysis Dataset of Comments in Serbian

The SentiComments.SR dataset includes the following three corpora of short texts annotated for the task of sentiment analysis:
The main SentiComments.SR corpus, consisting of 3490 movie-related comments;
The movie verification corpus, consisting of 464 movie-related comments;
The book verification corpus, consisting of 173 book-related comments.
Six sentiment labels were used in dataset annotation: +1, -1, +M, -M, +NS, and -NS, with the addition of an ‘s’ label suffix denoting the presence of sarcasm. The main corpus was annotated by two annotators working together, and therefore contains a single, unified sentiment label for each comment. The verification corpora were used to evaluate the quality, efficiency, and cost-effectiveness of the annotation framework, which is why they contain separate sentiment labels for six annotators. The construction of this dataset is described in the 2020 PLoS ONE paper.

Serbian AutoRIA - a model for automating the RIA mechanism for Serbian

Rapid Integrated Assessment (RIA) is a national policy document evaluation mechanism developed by the UNDP to help countries assess their readiness for the implementation of UN Sustainable Development Goals (SDG). The created model automates the RIA procedure for documents written in Serbian and is based on an earlier IBM approach developed for English. The model works by searching the documents for sentences / paragraphs that are a semantic match for one the SDG targets. The model repository also contains the Serbian national policy documents, as well as their stemmed versions. Further information can be found in the LT4All paper.

SETimes.SR reference training corpus of Serbian

SETimes.SR reference training corpus of Serbian consists of 87 thousand tokens or close to four thousand sentences in Serbian, gathered from the (now defunct) Southeast European Times news portal. Each news story is treated as a separate document and is segmented into sentences and tokens. The entire corpus is annotated on the level of lemmas and parts of speech, morphosyntax, syntactic dependencies, and named entities. The construction of this corpus is described in a JT-DH 2018 paper.

STSFineGrain – a collection of semantic textual similarity models

STSFineGrain is a Java package that contains a collection of semantic textual similarity models and a framework for their evaluation on STS corpora with fine-grained similarity scores. Seven different STS models are implemented, including three unsupervised and four supervised models. Among the supervised models there are both previously presented algorithms, such as LInSTSS and POST STSS, as well as the new POS-TF STSS model that outperforms them. Evaluation can be performed either on an entire dataset, or via cross-validation on it. STSFineGrain currently supports POST STSS and POS-TF STSS models for texts in Serbian and in English. Other models have no such language-related restrictions. This package was presented in the LREC 2018 paper.

The Serbian STS News Corpus (STS.news.sr)

The Serbian Semantic Textual Similarity News Corpus – STS.news.sr (ISLRN 146-979-597-345-4) consists of 1192 pairs of sentences in Serbian gathered from news sources on the web. Each sentence pair was manually annotated with fine-grained semantic similarity scores on the 0–5 scale. The final scores were obtained by averaging the individual scores of five annotators. The construction of this corpus is described in the LREC 2018 paper.

STSAnno – a tool for semantic textual similarity annotation

STSAnno is a tool written in Java for offline semantic textual similarity (STS) annotation. It allows the user/annotator to assign and change semantic similarity scores of text/sentence pairs in a given corpus. This tool was presented in the LREC 2018 paper.

The Serbian Movie Review Dataset (SerbMR)

The Serbian Movie Review Dataset (SerbMR) collection consists of three movie review datasets in Serbian which were constructed for the task of sentiment analysis:
Collected movie reviews in Serbian (ISLRN 252-457-966-231-5) – an unbalanced collection of 4725 movie reviews in Serbian.
SerbMR-2C – The Serbian Movie Review Dataset (2 Classes) (ISLRN 016-049-192-514-1) – a two-class balanced sentiment analysis dataset containing 1682 movie reviews in Serbian (841 positive and 841 negative reviews).
SerbMR-3C – The Serbian Movie Review Dataset (3 Classes) (ISLRN 229-533-271-984-0) – a three-class balanced sentiment analysis dataset containing 2523 movie reviews in Serbian (841 positive, 841 neutral, and 841 negative reviews).
The construction of this dataset collection is described in the LREC 2016 paper.

SCStemmers – A collection of stemmers for Serbian and Croatian

SCStemmers is a package containing four stemming algorithms for Serbian and Croatian:
– The greedy and the optimal subsumption-based stemmers for Serbian, by Vlado Kešelj and Danko Šipka,
– A refinement of their greedy stemmer for Serbian, by Nikola Milošević,
– A stemmer for Croatian, by Nikola Ljubešić and Ivan Pandžić.
SCStemmers can be used as a standalone tool or as a plug-in for Weka. The package was presented in the LREC 2016 paper.

Part-of-speech tag-supported short-text semantic similarity (POST STSS)

POST STSS is a method of computing short-text semantic similarity (i.e. semantic textual similarity) that uses a bag-of-words approach and relies on string overlap measures and lexical distributional semantics. Similarities between individual words are weighted according to their parts of speech. The optimal POS weights are determined using an incremental, hill climbing-based technique. The only language-specific resource POST STSS requires is a part-of-speech tagger (and optionally a lemmatizer), making it applicable to most languages. Further information about the algorithm can be found in the 2015 ComSIS paper. POST STSS is implemented within the STSFineGrain package.

Language-independent Short-Text Semantic Similarity (LInSTSS)

LInSTSS is a method of computing short-text semantic similarity (i.e. semantic textual similarity) that uses a bag-of-words approach and relies on string overlap measures and lexical distributional semantics. Similarities between individual words are weighted according to word frequencies. Since it does not use any language-specific tools or resouces, LInSTSS is easily applicable to any language. Further information about the algorithm can be found in the 2013 Decision Support Systems paper. LInSTSS is implemented within the STSFineGrain package.

The Serbian Paraphrase Corpus (paraphrase.sr)

The Serbian Paraphrase Corpus – paraphrase.sr (ISLRN 192-200-046-033-9) consists of 1194 pairs of sentences gathered from news sources on the web. Each sentence pair was manually annotated with a binary similarity score that indicates whether the sentences in the pair are semantically similar enough to be considered close paraphrases. The construction of this corpus is described in the 2011 TELFOR paper and the 2013 Decision Support Systems paper.

Datasets/Tools Tag Cloud

Research Projects

COMtext.SR

The COMtext.SR project aims to develop a basic set of tools and resources for Serbian NLP and to publish them under a permissive license. The main project goal is to cover text domains that have not been previously considered either in academic or commercial projects, such legal/administrative, financial, medical, etc. So far, the project has focused on the legal/administrative domain, as the most widely applicable one. Fine-tuned large language models and hand-annotated corpora for both Ekavian and Ijekavian pronunciations have been developed, tackling the tasks of morphosyntactic tagging, lemmatization, and named entity recognition. These new resources make it possible to directly address many important NLP use cases, such as advanced textual search, automated document indexing, anonymization of sensitive information (e.g., personally identifiable information), etc.

(Re-)imagining language, nation and collective identity in the 21st century: Language ideologies as new connections in post-Yugoslav digital mediascapes

This project of the Research Centre of the Slovenian Academy of Sciences and Arts investigates the links in conceptions of language and nation in the post-Yugoslav space, spanning six states (Slovenia, Croatia, Serbia, Bosnia & Herzegovina, Montenegro, Macedonia), and zooming in on news media texts and connected social-media citizen discourses. On this project, I have coordinated the collection, curation and publishing of specialized corpora of news media texts and digital citizen comments focused on the topic of language. Such corpora have been constructed using a standardized methodology across multiple languages, including Serbian, Croatian, and Slovenian.

Advancing Novel Textual Similarity-based Solutions in Software Development (AVANTES)

Advancing Novel Textual Similarity-based Solutions in Software Development (AVANTES) is a two-year project supported by the Science Fund of the Republic of Serbia which aims to develop various natural language processing (NLP) tools and techniques for use in software development. The main research question the project tackles is the relationship between programming code semantics and the meaning of code comments written in a natural language. Within the scope of the project, several NLP tasks are to be considered, including code comment categorization according to a comment type taxonomy, comment pair similarity, calculated using cross-level semantic similarity methods, and semantic code searching. In addition, the project will focus on the identification of various types of code clones. All of these research goals are to be addressed across multiple programming (C/C++/C#, Java, JavaScript, PHP, Python, SQL) and natural languages (English and Serbian). On this project, I am tasked with overseeing and leading the development of NLP tools and annotated datasets for NLP problems.

Automating the Rapid Integrated Assessment mechanism in Serbian

Rapid Integrated Assessment (RIA) is a national policy document evaluation mechanism developed by the UNDP to help countries assess their readiness for the implementation of UN Sustainable Development Goals. The aim of this project was to automate the RIA procedure for documents written in Serbian, based on an earlier UNDP/IBM pilot project for English. The project was proposed by UN Country Team Serbia and funded via the 2018 call for innovation proposals by the UN Development Operations Coordination Office (UNDOCO). Implementation was performed in cooperation with the SeConS Development Initiative Group.

Regional Linguistic Data Initiative (ReLDI)

Regional Linguistic Data Initiative – ReLDI (SNSF SCOPES project 160501) was a two-year institutional partnership between research units in Switzerland, Serbia and Croatia. As a research collaborator, I participated in the creation, distribution and analysis of linguistic/NLP datasets and tools for Serbian and Croatian. ReLDI Centre Belgrade was founded after the conclusion of the project in order to continue the activities of this partnership.

Teaching

(with prof. dr Boško Nikolić)

School year 2019/2020 - present - Created the teaching materials, gave lectures and practical demonstrations, and supervised student projects within the new Natural Language Processing course at the Master 4.0: Advanced information technologies in the digital transformation master’s degree study program of the School of Electrical Engineering and the Faculty of Organisational Sciences of the University of Belgrade.
School year 2017/2018 - present – Created the teaching materials, gave lectures and practical demonstrations, and supervised student projects within the new Natural Language Processing course at the Software Engineering master’s degree study program of the School of Electrical Engineering, University of Belgrade.
School year 2017/2018 - present – Created a part of the teaching materials and gave a part of the lectures and practical demonstrations within the Data Mining course at the Computer Science and Information Technology master’s degree study program of the School of Electrical Engineering, University of Belgrade.
School year 2016/2017 – Created the teaching materials, gave lectures and practical demonstrations, and supervised student projects within the new Machine Learning course at the Intelligent Systems PhD study program of the University of Belgrade.
School year 2015/2016 - present – Supervised several bachelor’s degree and master’s degree theses in NLP/ML of students at the School of Electrical Engineering, University of Belgrade.

Skills

Programming Languages

Python
Java
C++
C#
C
Matlab
SQL

ML/NLP Tools and Frameworks

HuggingFace Transformers and Simple Transformers
Scikit-learn
SciPy stack
- SciPy,
- NumPy,
- pandas,
- IPython/Jupyter,
- matplotlib,…
gensim
fastText
Natural Language Toolkit
CoreNLP
LIBSVM/LIBLINEAR

Annotation Tools

brat
WebAnno

Language Proficiency

Serbian (native)
English (fluent – C2, Cambridge Certificate of Proficiency in English (CPE), grade A)
French (limited)

Other Information

Awards, Grants and Scholarships

Grants and donations by multiple Serbian foundations and IT companies dedicated to the COMtext.SR project
Science Fund of the Republic of Serbia Grant 6526093 for the AVANTES (Advancing Novel Textual Similarity-based Solutions in Software Development) project
Jožef Stefan Institute/CLARIN project grant for the development of coreference annotation in Serbian and Croatian corpora
TELFOR 2016 Blažo Mirčevski award for the best paper by a young author
Jožef Stefan Institute/CLARIN project grant for the consolidation and enlargement of language resources in Croatian and Serbian
ReLDI project grant for the creation of language resources in Serbian and Croatian
2010 Scholarship of the Fund for Young Talents of the Republic of Serbia

Membership in Professional Organizations

ACL SIGSLAV – Association for Computational Linguistics Special Interest Group on Slavic Natural Language Processing

Peer-review Activities

Contact

Your name:
Your e-mail:
Your message:

vuk.batanovic@ic.etf.bg.ac.rs
Bulevar kralja Aleksandra 73, 11120 Belgrade, Serbia
Send a message or an e-mail for appointments