Naučni radovi | Vuk Batanović

Monolingual, multilingual and cross-lingual code comment classification

Code comments are one of the most useful forms of documentation and metadata for understanding software implementation. Previous research on code comment classification has focused only on comments in English, typically extracted from a few programming languages. This paper addresses the problem of code comment classification not only in the monolingual setting, but also in the multilingual and cross-lingual one, in order to examine whether they can outperform the traditional monolingual approach. To tackle this task, we introduce a novel, publicly available code comment dataset, consisting of over 10,000 code comments collected from software projects written in eight programming languages (C, C++, C#, Java, JavaScript/TypeScript, PHP, Python, and SQL). About half of them are written in Serbian while the other half are written in English. This dataset was manually annotated according to a newly proposed taxonomy of code comment categories. We fine-tune and evaluate multiple monolingual and multilingual pre-trained neural language models on the code comment classification task and compare their performances to several baselines. The best results for Serbian comments are obtained using the monolingual neural model BERTić, trained on Serbian and closely related languages. On the other hand, the optimal choice for English is the multilingual neural model multilingual BERT, which successfully extracts useful patterns from data in both languages. Although the cross-lingual setting shows some promise for simple binary classification, it has yet to reach sufficiently high performance levels for practical use. We also analyze model performance across different programming languages.

Marija Kostić, Vuk Batanović, Boško Nikolić

In EAAI, 2023.

Više informacija PDF Programski kod Skup podataka ReLDI tokenizator za srpski Stemeri za srpski i hrvatski Lematizator za srpski FastText vektori značenja reči za srpski (srpski veb korpus srWaC) FastText vektori značenja reči za srpski (Common Crawl) FastText vektori značenja reči za engleski BERTić LLM za srpski ELECTRA LLM za engleski Multilingual BERT LLM XLM-RoBERTa LLM

Metodologija rešavanja semantičkih problema u obradi kratkih tekstova napisanih na prirodnim jezicima sa ograničenim resursima

Statistički pristupi obradi prirodnih jezika tipično zahtevaju značajne količine anotiranih podataka, a često i različite pomoćne jezičke alate, što ograničava njihovu primenu u resursno ograničenim situacijama. U ovoj disertaciji predstavljena je metodologija razvoja statističkih rešenja u semantičkoj obradi prirodnih jezika sa ograničenim resursima. Ovakvi jezici se odlikuju ne samo limitiranim brojem postojećih jezičkih resursa, već i ograničenim mogućnostima za razvoj novih skupova podataka i namenskih alata i algoritama.
Predložena metodologija je usredsređena na kratke tekstove zbog njihove rasprostranjenosti u digitalnoj komunikaciji i zbog veće složenosti njihove semantičke obrade. Metodologija obuhvata sve faze izrade statističkih rešenja, od prikupljanja tekstualnog sadržaja, preko anotacije podataka, do formulisanja, obučavanja i evaluacije modela mašinskog učenja. Njena upotreba je detaljno ilustrovana na dva semantička problema – analizi sentimenta i određivanju semantičke sličnosti. Kao primer jezika sa ograničenim resursima korišćen je srpski jezik, ali se predložena metodologija može primeniti i na druge jezike iz ove kategorije.
Pored opšte metodologije, u doprinose ove disertacije spada razvoj novog, fleksibilnog sistema označavanja sentimenta kratkih tekstova, nove metrike za utvrđivanje ekonomičnosti anotacije, kao i nekoliko novih modela za određivanje semantičke sličnosti kratkih tekstova. Rezultati disertacije uključuju i kreiranje prvih javno dostupnih anotiranih skupova podataka za probleme analize sentimenta i određivanja semantičke sličnosti kratkih tekstova na srpskom jeziku, razvoj i evaluaciju većeg broja modela na ovim problemima, i prvu komparativnu evaluaciju većeg broja alata za morfološku normalizaciju na kratkim tekstovima na srpskom jeziku.

Vuk Batanović

Doktorska disertacija, Univerzitet u Beogradu - Elektrotehnički fakultet, 2020.

Više informacija PDF Zvaničan repozitorijum STS.news.sr korpus Skup podataka SentiComments.SR Stemeri za srpski i hrvatski Paket STSFineGrain Alat STSAnno

A versatile framework for resource-limited sentiment articulation, annotation, and analysis of short texts

Choosing a comprehensive and cost-effective way of articulating and annotating the sentiment of a text is not a trivial task, particularly when dealing with short texts, in which sentiment can be expressed through a wide variety of linguistic and rhetorical phenomena. This problem is especially conspicuous in resource-limited settings and languages, where design options are restricted either in terms of manpower and financial means required to produce appropriate sentiment analysis resources, or in terms of available language tools, or both. In this paper, we present a versatile approach to addressing this issue, based on multiple interpretations of sentiment labels that encode information regarding the polarity, subjectivity, and ambiguity of a text, as well as the presence of sarcasm or a mixture of sentiments. We demonstrate its use on Serbian, a resource-limited language, via the creation of a main sentiment analysis dataset focused on movie comments, and two smaller datasets belonging to the movie and book domains. In addition to measuring the quality of the annotation process, we propose a novel metric to validate its cost-effectiveness. Finally, the practicality of our approach is further validated by training, evaluating, and determining the optimal configurations of several different kinds of machine-learning models on a range of sentiment classification tasks using the produced dataset.

Vuk Batanović, Miloš Cvetanović, Boško Nikolić

In PLoS ONE, 2020.

Više informacija PDF Programski kod Skup podataka Srpski web korpus srWaC ReLDI tokenizator za srpski Stemeri za srpski i hrvatski BTagger za srpski HunPos i CST modeli za hrvatski ReLDI tager i lematizator za srpski i hrvatski

Otvoreni resursi i tehnologije za obradu srpskog jezika

Otvorenost jezičkih resursa i alata je od velike važnosti za povećanje kvaliteta i brzine razvoja tehnologija za računarsku obradu prirodnih jezika. U ovom radu predstavljeni su otvoreni resursi za obradu srpskog jezika. Opisani su ručno anotirani korpusi, kao i širi spektar alata i računarskih modela, uključujući i veb servis koji omogućava njihovo jednostavno korišćenje.

Vuk Batanović, Nikola Ljubešić, Tanja Samardžić, Maja Miličević Petrović

PSSOH 2020, 2020.

Više informacija PDF Slajdovi Video SETimes.SR korpus ReLDI-NormTagNER-sr korpus STS.news.sr korpus paraphrase.sr korpus Serbian Movie Review (SerbMR) korpus SentiComments.SR korpus Veb korpus srWaC Alat za redijakritizaciju Stemeri za srpski i hrvatski CLASSLA paket STSFineGrain paket ReLDIanno veb servis

Using Language Technologies to Automate the UNDP Rapid Integrated Assessment Mechanism in Serbian

Brza integrisana procena (engl. RIA) je procedura Programa Ujedinjenih nacija za razvoj koja podrazumeva poređenje državnih strateških dokumenata o razvoju i ciljeva održivog razvoja koje su definisale Ujedinjene nacije. U ovom radu predstavljamo srpski AutoRIA sistem koji automatizuje ovu proceduru na srpskom, jeziku sa ograničenim resursima, a razvijenom morfologijom. Razmatramo probleme koji se tiču pretprocesiranja podataka za ovaj zadatak, kao i opštu arhitekturu i jezičke specifičnosti sistema. Takođe evaluiramo efekte različitih podešavanja sistema na njegove performanse koristeći rezultate ranije, ručno sprovedene RIA procedure za Srbiju.

Vuk Batanović, Boško Nikolić

LT4All, 2019.

Više informacija PDF Programski kod Skup podataka Transliterator za srpsku ćirilicu/latinicu Stemeri za srpski i hrvatski

The "ReLDI effect": Collaborative development of manually annotated datasets for Slovene, Croatian and Serbian

With the rapid development and increasing accessibility of natural language processing (NLP) techniques, the exploitation of NLP inside electronic lexicography is on a rise. Textual datasets manually annotated with linguistic information are a backbone of the currently dominating paradigm in NLP based on supervised machine learning. However, developing such manually annotated datasets is a very costly activity, which is one of the reasons for limited availability of NLP technologies for languages with fewer speakers, and especially for less dominant language varieties such as the language of the Internet.
In this talk we present a series of collaborations between researchers developing such datasets for Slovene, Croatian and Serbian, three languages with just a few million speakers each. Close relatedness of these languages brings an opportunity for a synchronized approach to the development of resources and technologies, to the benefit of all parties. Due to the complex political environment, however, such an approach has not been established until the start of the ReLDI (Regional Linguistic Data Initiative) project. The main synergistic effect of the collaborations presented here is achieved by drastically lowering the efforts required to produce datasets in additional languages, primarily in the areas of (1) the development of annotation guidelines, (2) setting up the technical requirements for the annotation campaigns and (3) pre-annotation of data with models trained for another, but very close language.
The linguistic levels covered in the resulting datasets are those of tokenisation, sentence segmentation, normalisation, morphosyntax, lemmatisation, dependency parsing, semantic role labeling, named entity recognition and coreference resolution. Two varieties of each of the three languages are covered: the standard variety and the variety of the language of the Internet.

Nikola Ljubešić, Tanja Samardžić, Tomaž Erjavec, Darja Fišer, Maja Miličević Petrović, Simon Krek, Vuk Batanović

eReL, 2019.

Više informacija PDF

SETimes.SR – A Reference Training Corpus of Serbian

In this paper we present SETimes.SR – a gold standard dataset for Serbian, annotated with regard to document, sentence, and token segmentation, morphosyntax, lemmas, dependency syntax, and named entities. We describe the annotation layers and provide a basic statistical overview of them, and we discuss the method of encoding them in the CoNLL and the TEI format. In addition, we compare the SETimes.SR corpus with the older SETimes.HR dataset in Croatian.

Vuk Batanović, Nikola Ljubešić, Tanja Samardžić

JT-DH, 2018.

Više informacija PDF Slajdovi Skup podataka CLARIN repozitorijum NoSketch Engine interfejs KonText interfejs

hr500k – A Reference Training Corpus of Croatian

In this paper we present hr500k, a Croatian reference training corpus of 500 thousand tokens, segmented at document, sentence and word level, and annotated for morphosyntax, lemmas, dependency syntax, named entities, and semantic roles. We present each annotation layer via basic label statistics and describe the final encoding of the resource in CoNLL and TEI formats. We also give a description of the rather turbulent history of the resource and give insights into the topic and genre distribution in the corpus. Finally, we discuss further enrichments of the corpus with additional layers, which are already underway.

Nikola Ljubešić, Željko Agić, Filip Klubička, Vuk Batanović, Tomaž Erjavec

JT-DH, 2018.

Više informacija PDF Slajdovi Skup podataka CLARIN repozitorijum NoSketch Engine interfejs KonText interfejs

Fine-grained Semantic Textual Similarity for Serbian

Although the task of semantic textual similarity (STS) has gained in prominence in the last few years, annotated STS datasets for model training and evaluation, particularly those with fine-grained similarity scores, remain scarce for languages other than English, and practically non-existent for minor ones. In this paper, we present the Serbian Semantic Textual Similarity News Corpus (STS.news.sr) – an STS dataset for Serbian that contains 1192 sentence pairs annotated with fine-grained semantic similarity scores. We describe the process of its creation and annotation, and we analyze and compare our corpus with the existing news-based STS datasets in English and other major languages. Several existing STS models are evaluated on the Serbian STS News Corpus, and a new supervised bag-of-words model that combines part-of-speech weighting with term frequency weighting is proposed and shown to outperform similar methods. Since Serbian is a morphologically rich language, the effect of various morphological normalization tools on STS model performances is considered as well. The Serbian STS News Corpus, the annotation tool and guidelines used in its creation, and the STS model framework used in the evaluation are all made publicly available.

Vuk Batanović, Miloš Cvetanović, Boško Nikolić

LREC, 2018.

Više informacija PDF Programski kod Skup podataka Alat za anotaciju STSAnno Uputstva za anotaciju semantičke sličnosti kratkih tekstova Srpski web korpus srWaC ReLDI tokenizator za srpski Stemeri za srpski i hrvatski BTagger za srpski HunPos i CST modeli za hrvatski ReLDI tager i lematizator za srpski i hrvatski

Sentiment Classification of Documents in Serbian: The Effects of Morphological Normalization and Word Embeddings

An open issue in the sentiment classification of texts written in Serbian is the effect of different forms of morphological normalization and the usefulness of leveraging large amounts of unlabeled texts. In this paper, we assess the impact of lemmatizers and stemmers for Serbian on classifiers trained and evaluated on the Serbian Movie Review Dataset. We also consider the effectiveness of using word embeddings, generated from a large unlabeled corpus, as classification features.

Vuk Batanović, Boško Nikolić

Telfor Journal, 2017.

Više informacija PDF Skup podataka Srpski web korpus srWaC ReLDI tokenizator za srpski Stemeri za srpski i hrvatski BTagger za srpski HunPos i CST modeli za hrvatski ReLDI tager i lematizator za srpski i hrvatski Implementacija NBSVM algoritma za Weku

Sentiment Classification of Documents in Serbian: The Effects of Morphological Normalization

Sentiment classification of texts written in Serbian is still an under-researched topic. One of the open issues is how the different forms of morphological normalization affect the performances of different sentiment classifiers and which normalization procedure is optimal for this task. In this paper we assess and compare the impact of lemmatizers and stemmers for Serbian on classifiers trained and evaluated on the Serbian Movie Review Dataset.

Vuk Batanović, Boško Nikolić
Nagrada Blažo Mirčevski za najbolji rad mladog autora

TELFOR, 2016.

Više informacija PDF Skup podataka ReLDI tokenizator za srpski Stemeri za srpski i hrvatski BTagger za srpski HunPos i CST modeli za hrvatski ReLDI tager i lematizator za srpski i hrvatski Implementacija NBSVM algoritma za Weku

Reliable Baselines for Sentiment Analysis in Resource-Limited Languages: The Serbian Movie Review Dataset

Collecting data for sentiment analysis in resource-limited languages carries a significant risk of sample selection bias, since the small quantities of available data are most likely not representative of the whole population. Ignoring this bias leads to less robust machine learning classifiers and less reliable evaluation results. In this paper we present a dataset balancing algorithm that minimizes the sample selection bias by eliminating irrelevant systematic differences between the sentiment classes. We prove its superiority over the random sampling method and we use it to create the Serbian movie review dataset – SerbMR – the first balanced and topically uniform sentiment analysis dataset in Serbian. In addition, we propose an incremental way of finding the optimal combination of simple text processing options and machine learning features for sentiment classification. Several popular classifiers are used in conjunction with this evaluation approach in order to establish strong but reliable baselines for sentiment analysis in Serbian.

Vuk Batanović, Boško Nikolić, Milan Milosavljević

LREC, 2016.

Više informacija PDF Skup podataka Stemeri za srpski i hrvatski Implementacija NBSVM algoritma za Weku

Using Part-of-Speech Tags as Deep-Syntax Indicators in Determining Short-Text Semantic Similarity

This paper presents POST STSS, a method of determining short-text semantic similarity in which part-of-speech tags are used as indicators of the deeper syntactic information usually extracted by more advanced tools like parsers and semantic role labelers. Our model employs a part-of-speech weighting scheme and is based on a statistical bag-of-words approach. It does not require either hand-crafted knowledge bases or advanced syntactic tools, which makes it easily applicable to languages with limited natural language processing resources. By using a paraphrase recognition test, we demonstrate that our system achieves a higher accuracy than all existing statistical similarity algorithms and solutions of a more structural kind.

Vuk Batanović, Dragan Bojić

ComSIS, 2015.

Više informacija PDF Skup podataka

Evaluation and Classification of Syntax Usage in Determining Short-Text Semantic Similarity

This paper outlines and categorizes ways of using syntactic information in a number of algorithms for determining the semantic similarity of short texts. We consider the use of word order information, part-of-speech tagging, parsing and semantic role labeling. We analyze and evaluate the effects of syntax usage on algorithm performance by utilizing the results of a paraphrase detection test on the Microsoft Research Paraphrase Corpus. We also propose a new classification of algorithms based on their applicability to languages with scarce natural language processing tools.

Vuk Batanović, Dragan Bojić

Telfor Journal, 2014.

Više informacija PDF Skup podataka

Evaluacija i klasifikacija korišćenja sintaksnih informacija u određivanju semantičke sličnosti kratkih tekstova

U ovom radu su prikazani i kategorizovani načini korišćenja sintaksnih informacija u više algoritama za određivanje semantičke sličnosti kratkih tekstova. Evaluacija performansi algoritama je sprovedena korišćenjem rezultata testa detekcije parafraza iz Microsoft Research Paraphrase korpusa. Od svih opisanih algoritama i pristupa korišćenju sintaksnih informacija identifikovani su oni najpogodniji za primenu u jezicima sa ograničenim elektronskim jezičkim alatima i, imajući tu svrhu u vidu, predložena je nova klasifikacija algoritama.

Vuk Batanović, Dragan Bojić

TELFOR, 2013.

Više informacija PDF Skup podataka

Semantic similarity of short texts in languages with a deficient natural language processing support

Measuring the semantic similarity of short texts is a noteworthy problem since short texts are widely used on the Internet, in the form of product descriptions or captions, image and webpage tags, news headlines, etc. This paper describes a methodology which can be used to create a software system capable of determining the semantic similarity of two given short texts. The proposed LInSTSS approach is particularly suitable for application in situations when no large, publicly available, electronic linguistic resources can be found for the desired language. We describe the basic working principles of the system architecture we propose, as well as the stages of its construction and use. Also, we explain the procedure used to generate a paraphrase corpus which is then utilized in the evaluation process. Finally, we analyze the evaluation results obtained from a system created for the Serbian language, and we discuss possible improvements which would increase system accuracy.

Bojan Furlan, Vuk Batanović, Boško Nikolić

Decision Support Systems, 2013.

Više informacija PDF Programski kod Skup podataka

Softverski sistem za određivanje semantičke sličnosti kratkih tekstova na srpskom jeziku

U radu je opisan softverski sistem koji ocenjuje stepen semantičke sličnosti dva zadata kratka teksta na srpskom jeziku. Objašnjeni su osnovni principi na kojima sistem funkcioniše, kao i faze razvoja i evaluacije sistema. Takođe, opisan je postupak generisanja korpusa parafraza nad kojim je izvršena evaluacija. Na kraju, analizirani su rezultati evaluacije i razmotrene su mogućnosti poboljšanja preciznosti rada sistema.

Vuk Batanović, Bojan Furlan, Boško Nikolić

TELFOR, 2011.

Više informacija PDF Skup podataka

Softverski sistem za učenje ekspertskih sistema

U radu je opisan softverski sistem, koji je realizovan na Elektrotehničkom fakultetu u Beogradu, za učenje predmeta Ekspertski sistemi. Softver je razvijen kao edukacioni sistem namenjen studentima i koristi se za potrebe nastave na osnovnim i master studijama. Sistem omogućava pregled primera i zadataka korak po korak po temama kojima pripadaju i koje se rade na predavanjima i vežbama. Realizovane su sledeće simulacije: strategije pretraživanja po širini, po dubini, metodom planinarenja, metodom prvo najbolji, metodom grananja i ograničavanja, metodom A*, pretraživanja u igrama, produkcioni sistemi, opšti rešavač problema GPS, sistem za planiranje STRIPS, fuzzy logika, rezonovanje na osnovu faktora izvesnosti i drugi načini izražavanja neizvesnosti. Studenti mogu i sami da unose svoje primere i zadatke i tako simuliraju željenu situaciju. Simulaciju mogu u svakom trenutku vratiti u prethodni korak ili ići na sledeći. Na kraju simulacije student može odštampati i detaljan postupak rešavanja zadatka. Realizovani sistem predavačima omogućava mnogo efikasniji rad, a studentima brže savladavanje gradiva.

Dražen Drašković, Vuk Batanović, Boško Nikolić

TELFOR, 2010.

Više informacija PDF Programski kod