Fine-grained Semantic Textual Similarity for Serbian

Vuk Batanović, Miloš Cvetanović, Boško Nikolić

Abstract

Although the task of semantic textual similarity (STS) has gained in prominence in the last few years, annotated STS datasets for model training and evaluation, particularly those with fine-grained similarity scores, remain scarce for languages other than English, and practically non-existent for minor ones. In this paper, we present the Serbian Semantic Textual Similarity News Corpus (STS.news.sr) – an STS dataset for Serbian that contains 1192 sentence pairs annotated with fine-grained semantic similarity scores. We describe the process of its creation and annotation, and we analyze and compare our corpus with the existing news-based STS datasets in English and other major languages. Several existing STS models are evaluated on the Serbian STS News Corpus, and a new supervised bag-of-words model that combines part-of-speech weighting with term frequency weighting is proposed and shown to outperform similar methods. Since Serbian is a morphologically rich language, the effect of various morphological normalization tools on STS model performances is considered as well. The Serbian STS News Corpus, the annotation tool and guidelines used in its creation, and the STS model framework used in the evaluation are all made publicly available.

Type

Conference proceedings

Publication

Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, pp. 1370-1378, ELRA

Date

May 2018

Links

PDF Code Dataset STSAnno annotation tool STS annotation guidelines Serbian web corpus srWaC ReLDI tokenizer for Serbian Stemmers for Serbian and Croatian BTagger for Serbian HunPos and CST models for Croatian ReLDI tagger and lemmatizer for Serbian and Croatian