Statistical approaches to natural language processing typically require considerable amounts of labeled data, and often various auxiliary language tools as well, limiting their applicability in resource-limited settings. This thesis presents a methodology for developing statistical solutions in the semantic processing of natural languages with limited resources. In these languages, not only are existing language resources limited, but so are the capabilities for developing new datasets and dedicated tools and algorithms. The proposed methodology focuses on short texts due to their prevalence in digital communication, as well as the greater complexity regarding their semantic processing.
The methodology encompasses all phases in the creation of statistical solutions, from the collection of textual content, to data annotation, to the formulation, training, and evaluation of machine learning models. Its use is illustrated in detail on two semantic tasks – sentiment analysis and semantic textual similarity. The Serbian language is utilized as an example of a language with limited resources, but the proposed methodology can also be applied to other languages in this category.
In addition to the general methodology, the contributions of this thesis consist of the development of a new, flexible short-text sentiment annotation system, a new annotation cost-effectiveness metric, as well as several new semantic textual similarity models. The thesis results also include the creation of the first publicly available annotated datasets of short texts in Serbian for the tasks of sentiment analysis and semantic textual similarity, the development and evaluation of numerous models on these tasks, and the first comparative evaluation of multiple morphological normalization tools on short texts in Serbian.