KeseljSipkaStemmerOptimal

java.lang.Object
- weka.core.stemmers.SCStemmer
- - weka.core.stemmers.SerbianStemmer
  - - weka.core.stemmers.KeseljSipkaStemmer
    - - weka.core.stemmers.KeseljSipkaStemmerOptimal

All Implemented Interfaces:

java.io.Serializable, weka.core.RevisionHandler, weka.core.stemmers.Stemmer
```
public class KeseljSipkaStemmerOptimal
extends KeseljSipkaStemmer
```
Ova klasa implementira optimalni stemer za srpski opisan u radu:
Pristup izgradnji stemera i lematizora za jezike s bogatom fleksijom i oskudnim resursima zasnovan na obuhvatanju sufiksa, Vlado Kešelj, Danko Šipka, Infoteka 9(1-2), 21-31 (2008).
http://infoteka.bg.ac.rs/pdf/Srp/2008/04%20Vlado-Danko_Stemeri.pdf
(originalna implementacija u Perlu i drugi resursi su dostupni na: http://www.cs.dal.ca/~vlado/nlp/2007-sr/)

Ovaj stemer koristi tzv. dual1 kodovanje u kome se sva ćirilična slova prevode u latinična a svako latinično slovo koje sadrži dijakritičke oznake - š, đ, č, ć, ž, dž - se piše kao skup dva latinična slova bez dijakritičkih oznaka. Pored toga, slova lj/Lj i nj/Nj se prevode u oblike ly/Ly i ny/Ny.

This class implements the optimal stemmer for Serbian described in the paper:
A Suffix Subsumption-Based Approach to Building Stemmers and Lemmatizers for Highly Inflectional Languages with Sparse Resources, Vlado Kešelj, Danko Šipka, Infotheca 9(1-2), 23a-33a (2008).
http://infoteka.bg.ac.rs/pdf/Eng/2008/INFOTHECA_IX_1-2_May2008_23a-33a.pdf
(the original implementation in Perl and other resources are available at: http://www.cs.dal.ca/~vlado/nlp/2007-sr/)

This stemmer uses the so-called dual1 coding system in which all Cyrillic letters are transformed into their Latin equivalents and every Latin letter that contains diacritical marks - š, đ, č, ć, ž, dž - is written as a set of two Latin letters without the diacritical marks. Furthermore, the letters lj/Lj and nj/Nj are transformed into the ly/Ly and ny/Ny forms.

Author:

Vuk Batanović

See Also:

Reliable Baselines for Sentiment Analysis in Resource-Limited Languages: The Serbian Movie Review Dataset, Vuk Batanović, Boško Nikolić, Milan Milosavljević, in Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pp. 2688-2696, Portorož, Slovenia (2016).
https://github.com/vukbatanovic/SCStemmers
, Serialized Form

Field Summary

Fields
Modifier and Type Field and Description

private static long serialVersionUID
- Fields inherited from class weka.core.stemmers.SerbianStemmer
  maxSuffixLen, rules

Fields
Modifier and Type	Field and Description
`private static long`	`serialVersionUID`

Constructor Summary

Constructors
Constructor and Description

KeseljSipkaStemmerOptimal()

Constructors
Constructor and Description
`KeseljSipkaStemmerOptimal()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`private void`	`initOpt1()`
`private void`	`initOpt2()`
`private void`	`initOpt3()`
`private void`	`initOpt4()`
`private void`	`initOpt5()`
`private void`	`initOpt6()`
`protected void`	`initRules()` 17839 optimalih pravila oblika: 'sufiks', 'uklonjen sufiks' Tačnost na skupu za obučavanje (u izvornom radu) = 81.8309759422188 Metoda za inicijalizaciju je morala da bude podeljena na više podmetoda zbog dužine liste pravila

Methods inherited from class weka.core.stemmers.KeseljSipkaStemmer
convertToDual1Character, stemDual1Line, stemDual1Word

Methods inherited from class weka.core.stemmers.SerbianStemmer
convertToDual1File, convertToDual1String, convertToNormalFile, convertToNormalString, initMaxSuffixLen, stemDual1File, stemLine, stemWord

Methods inherited from class weka.core.stemmers.SCStemmer
getRevision, main, replaceSpaceWithNewLine, stem, stemFile, stemText

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - serialVersionUID
```
private static final long serialVersionUID
```
    See Also:
    
    Constant Field Values
- Constructor Detail
  - KeseljSipkaStemmerOptimal
```
public KeseljSipkaStemmerOptimal()
```
- Method Detail
  - initRules
```
protected void initRules()
```
    17839 optimalih pravila oblika: 'sufiks', 'uklonjen sufiks'
    Tačnost na skupu za obučavanje (u izvornom radu) = 81.8309759422188
    Metoda za inicijalizaciju je morala da bude podeljena na više podmetoda zbog dužine liste pravila
    
    17839 Optimal rules, form: 'suffix', 'removed suffix'
    Training accuracy (in the original paper) = 81.8309759422188
    The initialization method had to be divided into several submethods due to the length of the rule list
    
    Overrides:
    
    initRules in class SerbianStemmer
  - initOpt1
```
private void initOpt1()
```
  - initOpt2
```
private void initOpt2()
```
  - initOpt3
```
private void initOpt3()
```
  - initOpt4
```
private void initOpt4()
```
  - initOpt5
```
private void initOpt5()
```
  - initOpt6
```
private void initOpt6()
```

Class KeseljSipkaStemmerOptimal

Field Summary

Fields inherited from class weka.core.stemmers.SerbianStemmer

Constructor Summary

Method Summary

Methods inherited from class weka.core.stemmers.KeseljSipkaStemmer

Methods inherited from class weka.core.stemmers.SerbianStemmer

Methods inherited from class weka.core.stemmers.SCStemmer

Methods inherited from class java.lang.Object

Field Detail

serialVersionUID

Constructor Detail

KeseljSipkaStemmerOptimal

Method Detail

initRules

initOpt1

initOpt2

initOpt3

initOpt4

initOpt5

initOpt6