Sentiment Classification of Documents in Serbian: The Effects of Morphological Normalization and Word Embeddings

Vuk Batanović, Boško Nikolić

Abstract

An open issue in the sentiment classification of texts written in Serbian is the effect of different forms of morphological normalization and the usefulness of leveraging large amounts of unlabeled texts. In this paper, we assess the impact of lemmatizers and stemmers for Serbian on classifiers trained and evaluated on the Serbian Movie Review Dataset. We also consider the effectiveness of using word embeddings, generated from a large unlabeled corpus, as classification features.

Type

Journal

Publication

Telfor Journal, Vol. 9, No. 2, pp. 104-109

DOI

10.5937/telfor1702104B

Date

December 2017

Links

PDF Dataset Serbian web corpus srWaC ReLDI tokenizer for Serbian Stemmers for Serbian and Croatian BTagger for Serbian HunPos and CST models for Croatian ReLDI tagger and lemmatizer for Serbian and Croatian NBSVM implementation for Weka