public class KeseljSipkaStemmerOptimal extends KeseljSipkaStemmer
Ova klasa implementira optimalni stemer za srpski opisan u radu:
Pristup izgradnji stemera i lematizora za jezike s bogatom fleksijom i oskudnim resursima zasnovan na obuhvatanju sufiksa, Vlado Kešelj, Danko Šipka, Infoteka 9(1-2), 21-31 (2008).
http://infoteka.bg.ac.rs/pdf/Srp/2008/04%20Vlado-Danko_Stemeri.pdf
(originalna implementacija u Perlu i drugi resursi su dostupni na: http://www.cs.dal.ca/~vlado/nlp/2007-sr/)
Ovaj stemer koristi tzv. dual1 kodovanje u kome se sva ćirilična slova prevode u latinična a svako latinično slovo koje sadrži dijakritičke oznake - š, đ, č, ć, ž, dž - se piše kao skup dva latinična slova bez dijakritičkih oznaka. Pored toga, slova lj/Lj i nj/Nj se prevode u oblike ly/Ly i ny/Ny.
This class implements the optimal stemmer for Serbian described in the paper:
A Suffix Subsumption-Based Approach to Building Stemmers and Lemmatizers for Highly Inflectional Languages with Sparse Resources, Vlado Kešelj, Danko Šipka, Infotheca 9(1-2), 23a-33a (2008).
http://infoteka.bg.ac.rs/pdf/Eng/2008/INFOTHECA_IX_1-2_May2008_23a-33a.pdf
(the original implementation in Perl and other resources are available at: http://www.cs.dal.ca/~vlado/nlp/2007-sr/)
This stemmer uses the so-called dual1 coding system in which all Cyrillic letters are transformed into their Latin equivalents and every Latin letter that contains diacritical marks - š, đ, č, ć, ž, dž - is written as a set of two Latin letters without the diacritical marks. Furthermore, the letters lj/Lj and nj/Nj are transformed into the ly/Ly and ny/Ny forms.
Modifier and Type | Field and Description |
---|---|
private static long |
serialVersionUID |
maxSuffixLen, rules
Constructor and Description |
---|
KeseljSipkaStemmerOptimal() |
Modifier and Type | Method and Description |
---|---|
private void |
initOpt1() |
private void |
initOpt2() |
private void |
initOpt3() |
private void |
initOpt4() |
private void |
initOpt5() |
private void |
initOpt6() |
protected void |
initRules()
17839 optimalih pravila oblika: 'sufiks', 'uklonjen sufiks'
Tačnost na skupu za obučavanje (u izvornom radu) = 81.8309759422188 Metoda za inicijalizaciju je morala da bude podeljena na više podmetoda zbog dužine liste pravila |
convertToDual1Character, stemDual1Line, stemDual1Word
convertToDual1File, convertToDual1String, convertToNormalFile, convertToNormalString, initMaxSuffixLen, stemDual1File, stemLine, stemWord
getRevision, main, replaceSpaceWithNewLine, stem, stemFile, stemText
private static final long serialVersionUID
protected void initRules()
17839 optimalih pravila oblika: 'sufiks', 'uklonjen sufiks'
Tačnost na skupu za obučavanje (u izvornom radu) = 81.8309759422188
Metoda za inicijalizaciju je morala da bude podeljena na više podmetoda zbog dužine liste pravila
17839 Optimal rules, form: 'suffix', 'removed suffix'
Training accuracy (in the original paper) = 81.8309759422188
The initialization method had to be divided into several submethods due to the length of the rule list
initRules
in class SerbianStemmer
private void initOpt1()
private void initOpt2()
private void initOpt3()
private void initOpt4()
private void initOpt5()
private void initOpt6()