MilosevicStemmer

java.lang.Object
- weka.core.stemmers.SCStemmer
- - weka.core.stemmers.SerbianStemmer
  - - weka.core.stemmers.MilosevicStemmer

All Implemented Interfaces:

java.io.Serializable, weka.core.RevisionHandler, weka.core.stemmers.Stemmer
```
public class MilosevicStemmer
extends SerbianStemmer
```
Ova klasa implementira stemer za srpski kreiran u master radu Nikole Miloševića, opisan u ArXiv dokumentu:
Stemmer for Serbian language, Nikola Milošević, arXiv preprint arXiv:1209.4471 (2012).
http://arxiv.org/abs/1209.4471

Ovaj stemer koristi blago modifikovanu verziju tzv. dual1 kodovanja u kome se sva ćirilična slova prevode u latinična a svako latinično slovo koje sadrži dijakritičke oznake - š, đ, č, ć, ž, dž - se piše kao skup dva latinična slova bez dijakritičkih oznaka. Za razliku od stemera Kešelja i Šipke, ovde se slova lj/Lj i nj/Nj NE prevode u oblike ly/Ly i ny/Ny.

Napomena - implementacija algoritma se u nekim aspektima razlikuje od one iz ArXiv dokumenta - videti kod za detalje

This class implements the stemmer created in the Master's degree thesis of Nikola Milošević, described in the ArXiv paper:
Stemmer for Serbian language, Nikola Milošević, arXiv preprint arXiv:1209.4471 (2012).
http://arxiv.org/abs/1209.4471

This stemmer uses a slightly modified version of the so-called dual1 coding system in which all Cyrillic letters are transformed into their Latin equivalents and every Latin letter that contains diacritical marks - š, đ, č, ć, ž, dž - is written as a set of two Latin letters without the diacritical marks. In contrast to the stemmers of Kešelj and Šipka, the letters lj/Lj and nj/Nj are NOT transformed into the ly/Ly and ny/Ny forms here.

Note - the algorithm implementation differs in certain aspects from the one in the ArXiv document - see the code for details

Author:

Vuk Batanović

See Also:

Reliable Baselines for Sentiment Analysis in Resource-Limited Languages: The Serbian Movie Review Dataset, Vuk Batanović, Boško Nikolić, Milan Milosavljević, in Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pp. 2688-2696, Portorož, Slovenia (2016).
https://github.com/vukbatanovic/SCStemmers
, Serialized Form

Field Summary

Fields
Modifier and Type	Field and Description
`private java.util.HashMap<java.lang.String,java.lang.String>`	`dictionary` Rečnik koji se koristi za normalizovanje često korišćenih nepravilnih glagola
`private static long`	`serialVersionUID`

Fields inherited from class weka.core.stemmers.SerbianStemmer
maxSuffixLen, rules

Constructor Summary

Constructors
Constructor and Description

MilosevicStemmer()

Constructors
Constructor and Description
`MilosevicStemmer()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`protected java.lang.String`	`convertToDual1Character(int intCharacter, char oldChar)` Konvertuje jedan karakter iz standardnog oblika (ćirilice ili latinice) u dual1 kodiranje.
`protected void`	`initRules()` Milošević: Currently 285 rules
`java.lang.String`	`stemDual1Line(java.lang.String line)` Stemuje liniju teksta koja je napisana u dual1 kodiranju
`java.lang.String`	`stemDual1Word(java.lang.String word)` U ovoj implementaciji je donekle izmenjen originalni Miloševićev algoritam, tako da više liči na stemere Kešelja i Šipke.

Methods inherited from class weka.core.stemmers.SerbianStemmer
convertToDual1File, convertToDual1String, convertToNormalFile, convertToNormalString, initMaxSuffixLen, stemDual1File, stemLine, stemWord

Methods inherited from class weka.core.stemmers.SCStemmer
getRevision, main, replaceSpaceWithNewLine, stem, stemFile, stemText

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - serialVersionUID
```
private static final long serialVersionUID
```
    See Also:
    
    Constant Field Values
  - dictionary
```
private java.util.HashMap<java.lang.String,java.lang.String> dictionary
```
    Rečnik koji se koristi za normalizovanje često korišćenih nepravilnih glagola
    A dictionary which is used to normalize the frequently used irregular verbs
- Constructor Detail
  - MilosevicStemmer
```
public MilosevicStemmer()
```
- Method Detail
  - stemDual1Word
```
public java.lang.String stemDual1Word(java.lang.String word)
```
    U ovoj implementaciji je donekle izmenjen originalni Miloševićev algoritam, tako da više liči na stemere Kešelja i Šipke. Uzrok toga je što su uočena dva problema:
    1. Originalni algoritam je sporiji od stemera Kešelja i Šipke jer iterira kroz sva pravila za svaku reč umesto da pronađe ono pravilo koje zahvata maksimalan deo sufiksa.
    2. Pošto originalni algoritam uzima u obzir redosled kojim su pravila uneta u sistem, neka složenija/duža pravila postaju beskorisna jer se do njih nikako ne može doći zbog kraćih pravila ispred njih. Npr. kod reči koje se završavaju na 'ene' ili 'une' odseći će se samo 'ne' jer to pravilo u redosledu pravila dolazi pre navedena duža dva (ovo ponašanje bi bilo ispravno samo kod reči koje bi odsecanjem dužih sufiksa bile svedene na 1 slovo). Međutim, kako ovakvih primera ima relativno malo, izlaz ove modifikovane verzije Miloševićevog algoritma se retko gde razlikuje od izlaza originala.
    U dnu funkcije je zakomentarisan originalni algoritam.
    
    In this implementation the original Milošević's algorithm is somewhat altered, so it looks more like the stemmers of Kešelj and Šipka. The reason for this is that two issues were detected:
    1. The original algorithm is slower than the stemmers of Kešelj and Šipka since it iterates through all the rules for each word instead of finding the rule which covers the maximal part of the suffix.
    2. Since the original algorithm takes into account the ordering in which the rules were entered into the system, some rules which are longer/more complex become useless, as it becomes impossible to reach them due to the shorter rules before them. For instance, in words ending with 'ene' or 'une' only 'ne' will be removed since that rule's place in the rule ordering is before the aforementioned longer two (this behavior would be correct only for words which would be reduced to 1 letter if the longer suffixes were to be removed). However, since there are relatively few such examples, the output of this modified version of Milošević's algorithm rarely differs from the output of the original.
    The original algorithm is commented out at the bottom of the function.
    Specified by:
    
    stemDual1Word in class SerbianStemmer
    
    Parameters:
    
    word - Reč koju treba stemovati
    The word to be stemmed
    
    Returns:
    
    Stemovana reč
    The stemmed word
  - stemDual1Line
```
public java.lang.String stemDual1Line(java.lang.String line)
```
    Description copied from class: SerbianStemmer
    
    Stemuje liniju teksta koja je napisana u dual1 kodiranju
    Stems a line of text written in the dual1 coding system
    
    Specified by:
    
    stemDual1Line in class SerbianStemmer
    
    Parameters:
    
    line - Linija teksta koju treba obraditi
    The line of text to be processed
    
    Returns:
    
    Linija teksta sa stemovanim rečima
    The line of text with stemmed words
  - convertToDual1Character
```
protected java.lang.String convertToDual1Character(int intCharacter,
                                                   char oldChar)
```
    Konvertuje jedan karakter iz standardnog oblika (ćirilice ili latinice) u dual1 kodiranje. Miloševićev stemer ne implementira pun dual1 sistem kodiranja - izostavlja konvertovanje 'lj' i 'nj' u 'ly' i 'ny', pa je ova funkcija blago drugačija u odnosu na onu korišćenu kod stemera Kešelja i Šipke.
    Converts a given character from the standard form (in the Cyrillic or Latin script) to the dual1 coding system. Milošević's stemmer does not implement the full dual1 coding system - it dispenses with the conversion of 'lj' and 'nj' into 'ly' and 'ny', making this function slighty different to the one used for the stemmers of Kešelj and Šipka
    
    Specified by:
    
    convertToDual1Character in class SerbianStemmer
    
    Parameters:
    
    intCharacter - Unicode kod karaktera koji treba prevesti u dual1 sistem
    Unicode code point of the character that should be translated into the dual1 system
    
    oldChar - Karakter koji je u tekstu prethodio trenutno zadatom karakteru
    The character which preceded the currently given one within the text
    
    Returns:
    
    String koji sadrži dual1 reprezentaciju zadatog karaktera
    A string which contains the dual1 representation of the given character
  - initRules
```
protected void initRules()
```
    Milošević: Currently 285 rules
    
    Overrides:
    
    initRules in class SerbianStemmer

Class MilosevicStemmer

Field Summary

Fields inherited from class weka.core.stemmers.SerbianStemmer

Constructor Summary

Method Summary

Methods inherited from class weka.core.stemmers.SerbianStemmer

Methods inherited from class weka.core.stemmers.SCStemmer

Methods inherited from class java.lang.Object

Field Detail

serialVersionUID

dictionary

Constructor Detail

MilosevicStemmer

Method Detail

stemDual1Word

stemDual1Line

convertToDual1Character

initRules