Publications | Vuk Batanović

Monolingual, multilingual and cross-lingual code comment classification

Code comments are one of the most useful forms of documentation and metadata for understanding software implementation. Previous research on code comment classification has focused only on comments in English, typically extracted from a few programming languages. This paper addresses the problem of code comment classification not only in the monolingual setting, but also in the multilingual and cross-lingual one, in order to examine whether they can outperform the traditional monolingual approach. To tackle this task, we introduce a novel, publicly available code comment dataset, consisting of over 10,000 code comments collected from software projects written in eight programming languages (C, C++, C#, Java, JavaScript/TypeScript, PHP, Python, and SQL). About half of them are written in Serbian while the other half are written in English. This dataset was manually annotated according to a newly proposed taxonomy of code comment categories. We fine-tune and evaluate multiple monolingual and multilingual pre-trained neural language models on the code comment classification task and compare their performances to several baselines. The best results for Serbian comments are obtained using the monolingual neural model BERTić, trained on Serbian and closely related languages. On the other hand, the optimal choice for English is the multilingual neural model multilingual BERT, which successfully extracts useful patterns from data in both languages. Although the cross-lingual setting shows some promise for simple binary classification, it has yet to reach sufficiently high performance levels for practical use. We also analyze model performance across different programming languages.

Marija Kostić, Vuk Batanović, Boško Nikolić

In EAAI, 2023.

Details PDF Code Dataset ReLDI tokenizer for Serbian Stemmers for Serbian and Croatian Lemmatizer for Serbian FastText word embeddings for Serbian (Serbian web corpus srWaC) FastText word embeddings for Serbian (Common Crawl) FastText word embeddings for English BERTić LLM for Serbian ELECTRA LLM for English Multilingual BERT LLM XLM-RoBERTa LLM

A methodology for solving semantic tasks in the processing of short texts written in natural languages with limited resources

Statistical approaches to natural language processing typically require considerable amounts of labeled data, and often various auxiliary language tools as well, limiting their applicability in resource-limited settings. This thesis presents a methodology for developing statistical solutions in the semantic processing of natural languages with limited resources. In these languages, not only are existing language resources limited, but so are the capabilities for developing new datasets and dedicated tools and algorithms. The proposed methodology focuses on short texts due to their prevalence in digital communication, as well as the greater complexity regarding their semantic processing.
The methodology encompasses all phases in the creation of statistical solutions, from the collection of textual content, to data annotation, to the formulation, training, and evaluation of machine learning models. Its use is illustrated in detail on two semantic tasks – sentiment analysis and semantic textual similarity. The Serbian language is utilized as an example of a language with limited resources, but the proposed methodology can also be applied to other languages in this category.
In addition to the general methodology, the contributions of this thesis consist of the development of a new, flexible short-text sentiment annotation system, a new annotation cost-effectiveness metric, as well as several new semantic textual similarity models. The thesis results also include the creation of the first publicly available annotated datasets of short texts in Serbian for the tasks of sentiment analysis and semantic textual similarity, the development and evaluation of numerous models on these tasks, and the first comparative evaluation of multiple morphological normalization tools on short texts in Serbian.

Vuk Batanović

PhD Thesis, University of Belgrade - School of Electrical Engineering, 2020.

Details PDF Official Repository STS.news.sr corpus SentiComments.SR dataset Stemmers for Serbian and Croatian STSFineGrain package STSAnno tool

A versatile framework for resource-limited sentiment articulation, annotation, and analysis of short texts

Choosing a comprehensive and cost-effective way of articulating and annotating the sentiment of a text is not a trivial task, particularly when dealing with short texts, in which sentiment can be expressed through a wide variety of linguistic and rhetorical phenomena. This problem is especially conspicuous in resource-limited settings and languages, where design options are restricted either in terms of manpower and financial means required to produce appropriate sentiment analysis resources, or in terms of available language tools, or both. In this paper, we present a versatile approach to addressing this issue, based on multiple interpretations of sentiment labels that encode information regarding the polarity, subjectivity, and ambiguity of a text, as well as the presence of sarcasm or a mixture of sentiments. We demonstrate its use on Serbian, a resource-limited language, via the creation of a main sentiment analysis dataset focused on movie comments, and two smaller datasets belonging to the movie and book domains. In addition to measuring the quality of the annotation process, we propose a novel metric to validate its cost-effectiveness. Finally, the practicality of our approach is further validated by training, evaluating, and determining the optimal configurations of several different kinds of machine-learning models on a range of sentiment classification tasks using the produced dataset.

Vuk Batanović, Miloš Cvetanović, Boško Nikolić

In PLoS ONE, 2020.

Details PDF Code Dataset Serbian web corpus srWaC ReLDI tokenizer for Serbian Stemmers for Serbian and Croatian BTagger for Serbian HunPos and CST models for Croatian ReLDI tagger and lemmatizer for Serbian and Croatian

Open Resources and Technologies for Serbian Language Processing

Otvorenost jezičkih resursa i alata je od velike važnosti za povećanje kvaliteta i brzine razvoja tehnologija za računarsku obradu prirodnih jezika. U ovom radu predstavljeni su otvoreni resursi za obradu srpskog jezika. Opisani su ručno anotirani korpusi, kao i širi spektar alata i računarskih modela, uključujući i veb servis koji omogućava njihovo jednostavno korišćenje.

Vuk Batanović, Nikola Ljubešić, Tanja Samardžić, Maja Miličević Petrović

In PSSOH 2020, 2020.

Details PDF Slides Video SETimes.SR corpus ReLDI-NormTagNER-sr corpus STS.news.sr corpus paraphrase.sr corpus Serbian Movie Review (SerbMR) corpus SentiComments.SR corpus Web corpus srWaC Diacritic restoration tool Stemmers for Serbian and Croatian CLASSLA package STSFineGrain package ReLDIanno web service

Using Language Technologies to Automate the UNDP Rapid Integrated Assessment Mechanism in Serbian

Rapid Integrated Assessment (RIA) is a United Nations Development Programme procedure involving a comparison between a country’s development policy documents and the UN-defined Sustainable Development Goals. In this paper, we present the Serbian AutoRIA system that automates this procedure in Serbian, a resource-limited yet morphologically rich language. We discuss the issues regarding the preprocessing of data for this task, and the general architecture and language-related specificities of the system. We also evaluate the performance effects of various system settings using the results of a previous, manually completed RIA procedure for Serbia.

Vuk Batanović, Boško Nikolić

In LT4All, 2019.

Details PDF Code Dataset Transliterator for the Serbian Cyrillic/Latin script Stemmers for Serbian and Croatian

The "ReLDI effect": Collaborative development of manually annotated datasets for Slovene, Croatian and Serbian

With the rapid development and increasing accessibility of natural language processing (NLP) techniques, the exploitation of NLP inside electronic lexicography is on a rise. Textual datasets manually annotated with linguistic information are a backbone of the currently dominating paradigm in NLP based on supervised machine learning. However, developing such manually annotated datasets is a very costly activity, which is one of the reasons for limited availability of NLP technologies for languages with fewer speakers, and especially for less dominant language varieties such as the language of the Internet.
In this talk we present a series of collaborations between researchers developing such datasets for Slovene, Croatian and Serbian, three languages with just a few million speakers each. Close relatedness of these languages brings an opportunity for a synchronized approach to the development of resources and technologies, to the benefit of all parties. Due to the complex political environment, however, such an approach has not been established until the start of the ReLDI (Regional Linguistic Data Initiative) project. The main synergistic effect of the collaborations presented here is achieved by drastically lowering the efforts required to produce datasets in additional languages, primarily in the areas of (1) the development of annotation guidelines, (2) setting up the technical requirements for the annotation campaigns and (3) pre-annotation of data with models trained for another, but very close language.
The linguistic levels covered in the resulting datasets are those of tokenisation, sentence segmentation, normalisation, morphosyntax, lemmatisation, dependency parsing, semantic role labeling, named entity recognition and coreference resolution. Two varieties of each of the three languages are covered: the standard variety and the variety of the language of the Internet.

Nikola Ljubešić, Tanja Samardžić, Tomaž Erjavec, Darja Fišer, Maja Miličević Petrović, Simon Krek, Vuk Batanović

In eReL, 2019.

Details PDF

SETimes.SR – A Reference Training Corpus of Serbian

In this paper we present SETimes.SR – a gold standard dataset for Serbian, annotated with regard to document, sentence, and token segmentation, morphosyntax, lemmas, dependency syntax, and named entities. We describe the annotation layers and provide a basic statistical overview of them, and we discuss the method of encoding them in the CoNLL and the TEI format. In addition, we compare the SETimes.SR corpus with the older SETimes.HR dataset in Croatian.

Vuk Batanović, Nikola Ljubešić, Tanja Samardžić

In JT-DH, 2018.

Details PDF Slides Dataset CLARIN repository NoSketch Engine interface KonText interface

hr500k – A Reference Training Corpus of Croatian

In this paper we present hr500k, a Croatian reference training corpus of 500 thousand tokens, segmented at document, sentence and word level, and annotated for morphosyntax, lemmas, dependency syntax, named entities, and semantic roles. We present each annotation layer via basic label statistics and describe the final encoding of the resource in CoNLL and TEI formats. We also give a description of the rather turbulent history of the resource and give insights into the topic and genre distribution in the corpus. Finally, we discuss further enrichments of the corpus with additional layers, which are already underway.

Nikola Ljubešić, Željko Agić, Filip Klubička, Vuk Batanović, Tomaž Erjavec

In JT-DH, 2018.

Details PDF Slides Dataset CLARIN repository NoSketch Engine interface KonText interface

Fine-grained Semantic Textual Similarity for Serbian

Although the task of semantic textual similarity (STS) has gained in prominence in the last few years, annotated STS datasets for model training and evaluation, particularly those with fine-grained similarity scores, remain scarce for languages other than English, and practically non-existent for minor ones. In this paper, we present the Serbian Semantic Textual Similarity News Corpus (STS.news.sr) – an STS dataset for Serbian that contains 1192 sentence pairs annotated with fine-grained semantic similarity scores. We describe the process of its creation and annotation, and we analyze and compare our corpus with the existing news-based STS datasets in English and other major languages. Several existing STS models are evaluated on the Serbian STS News Corpus, and a new supervised bag-of-words model that combines part-of-speech weighting with term frequency weighting is proposed and shown to outperform similar methods. Since Serbian is a morphologically rich language, the effect of various morphological normalization tools on STS model performances is considered as well. The Serbian STS News Corpus, the annotation tool and guidelines used in its creation, and the STS model framework used in the evaluation are all made publicly available.

Vuk Batanović, Miloš Cvetanović, Boško Nikolić

In LREC, 2018.

Details PDF Code Dataset STSAnno annotation tool STS annotation guidelines Serbian web corpus srWaC ReLDI tokenizer for Serbian Stemmers for Serbian and Croatian BTagger for Serbian HunPos and CST models for Croatian ReLDI tagger and lemmatizer for Serbian and Croatian

Sentiment Classification of Documents in Serbian: The Effects of Morphological Normalization and Word Embeddings

An open issue in the sentiment classification of texts written in Serbian is the effect of different forms of morphological normalization and the usefulness of leveraging large amounts of unlabeled texts. In this paper, we assess the impact of lemmatizers and stemmers for Serbian on classifiers trained and evaluated on the Serbian Movie Review Dataset. We also consider the effectiveness of using word embeddings, generated from a large unlabeled corpus, as classification features.

Vuk Batanović, Boško Nikolić

In Telfor Journal, 2017.

Details PDF Dataset Serbian web corpus srWaC ReLDI tokenizer for Serbian Stemmers for Serbian and Croatian BTagger for Serbian HunPos and CST models for Croatian ReLDI tagger and lemmatizer for Serbian and Croatian NBSVM implementation for Weka

Sentiment Classification of Documents in Serbian: The Effects of Morphological Normalization

Sentiment classification of texts written in Serbian is still an under-researched topic. One of the open issues is how the different forms of morphological normalization affect the performances of different sentiment classifiers and which normalization procedure is optimal for this task. In this paper we assess and compare the impact of lemmatizers and stemmers for Serbian on classifiers trained and evaluated on the Serbian Movie Review Dataset.

Vuk Batanović, Boško Nikolić
Blažo Mirčevski award for the best paper by a young author

In TELFOR, 2016.

Details PDF Dataset ReLDI tokenizer for Serbian Stemmers for Serbian and Croatian BTagger for Serbian HunPos and CST models for Croatian ReLDI tagger and lemmatizer for Serbian and Croatian NBSVM implementation for Weka

Reliable Baselines for Sentiment Analysis in Resource-Limited Languages: The Serbian Movie Review Dataset

Collecting data for sentiment analysis in resource-limited languages carries a significant risk of sample selection bias, since the small quantities of available data are most likely not representative of the whole population. Ignoring this bias leads to less robust machine learning classifiers and less reliable evaluation results. In this paper we present a dataset balancing algorithm that minimizes the sample selection bias by eliminating irrelevant systematic differences between the sentiment classes. We prove its superiority over the random sampling method and we use it to create the Serbian movie review dataset – SerbMR – the first balanced and topically uniform sentiment analysis dataset in Serbian. In addition, we propose an incremental way of finding the optimal combination of simple text processing options and machine learning features for sentiment classification. Several popular classifiers are used in conjunction with this evaluation approach in order to establish strong but reliable baselines for sentiment analysis in Serbian.

Vuk Batanović, Boško Nikolić, Milan Milosavljević

In LREC, 2016.

Details PDF Dataset Stemmers for Serbian and Croatian NBSVM implementation for Weka

Using Part-of-Speech Tags as Deep-Syntax Indicators in Determining Short-Text Semantic Similarity

This paper presents POST STSS, a method of determining short-text semantic similarity in which part-of-speech tags are used as indicators of the deeper syntactic information usually extracted by more advanced tools like parsers and semantic role labelers. Our model employs a part-of-speech weighting scheme and is based on a statistical bag-of-words approach. It does not require either hand-crafted knowledge bases or advanced syntactic tools, which makes it easily applicable to languages with limited natural language processing resources. By using a paraphrase recognition test, we demonstrate that our system achieves a higher accuracy than all existing statistical similarity algorithms and solutions of a more structural kind.

Vuk Batanović, Dragan Bojić

In ComSIS, 2015.

Details PDF Dataset

Evaluation and Classification of Syntax Usage in Determining Short-Text Semantic Similarity

This paper outlines and categorizes ways of using syntactic information in a number of algorithms for determining the semantic similarity of short texts. We consider the use of word order information, part-of-speech tagging, parsing and semantic role labeling. We analyze and evaluate the effects of syntax usage on algorithm performance by utilizing the results of a paraphrase detection test on the Microsoft Research Paraphrase Corpus. We also propose a new classification of algorithms based on their applicability to languages with scarce natural language processing tools.

Vuk Batanović, Dragan Bojić

In Telfor Journal, 2014.

Details PDF Dataset

Evaluacija i klasifikacija korišćenja sintaksnih informacija u određivanju semantičke sličnosti kratkih tekstova

This paper outlines and categorizes ways of using syntax information in a number of algorithms for determining short text semantic similarity. Algorithm performance was evaluated using the results of a paraphrase detection test on the Microsoft Research Paraphrase Corpus. Among the described algorithms and approaches to using syntax information we identify those best suited for application in languages with limited electronic linguistic tools and, with that goal in mind, we propose a new algorithm classification.

Vuk Batanović, Dragan Bojić
Paper in Serbian

In TELFOR, 2013.

Details PDF Dataset

Semantic similarity of short texts in languages with a deficient natural language processing support

Measuring the semantic similarity of short texts is a noteworthy problem since short texts are widely used on the Internet, in the form of product descriptions or captions, image and webpage tags, news headlines, etc. This paper describes a methodology which can be used to create a software system capable of determining the semantic similarity of two given short texts. The proposed LInSTSS approach is particularly suitable for application in situations when no large, publicly available, electronic linguistic resources can be found for the desired language. We describe the basic working principles of the system architecture we propose, as well as the stages of its construction and use. Also, we explain the procedure used to generate a paraphrase corpus which is then utilized in the evaluation process. Finally, we analyze the evaluation results obtained from a system created for the Serbian language, and we discuss possible improvements which would increase system accuracy.

Bojan Furlan, Vuk Batanović, Boško Nikolić

In Decision Support Systems, 2013.

Details PDF Code Dataset

Softverski sistem za određivanje semantičke sličnosti kratkih tekstova na srpskom jeziku

This paper describes a software system for determining the degree of semantic similarity of two short texts written in Serbian. Its basic working principles are presented, as well as the phases of its functioning and evaluation. It also describes the process of generating a paraphrase corpus, which was used for evaluation purposes. Finally, evaluation results are discussed and further improvements of the system’s precision are considered.

Vuk Batanović, Bojan Furlan, Boško Nikolić
Paper in Serbian

In TELFOR, 2011.

Details PDF Dataset

Softverski sistem za učenje ekspertskih sistema

This paper describes the software system for expert system learning which was realized at the Faculty of Electrical Engineering in Belgrade. The software has been developed as an educational system for students and will be used for teaching the subject Expert Systems within undergraduate and Master studies. The system guides students through examples and step-by-step tasks related to topics to which they belong and which are done as part of lectures and tutorials. The following issues are covered: breadth-first and depth-first strategies, hill-climbing method, best-first method, branch and bound methods, method A*, search in games, production systems, general problem solver, STRIPS planning system and undetermined environment operation systems – fuzzy logic, probability factor reasoning and other ways of expressing uncertainty. Students may enter they own examples and tasks and thus obtain correct solutions. At every moment, there is a possibility of going one step back or forward. At the very end, a student may also print the detailed how-to procedure of solving the task. The implemented system improves lecturer’s efficiency and enhances knowledge acquisition of innovative curricula.

Dražen Drašković, Vuk Batanović, Boško Nikolić
Paper in Serbian

In TELFOR, 2010.

Details PDF Code