public class KeseljSipkaStemmerGreedy extends KeseljSipkaStemmer
Ova klasa implementira pohlepni stemer za srpski opisan u radu:
Pristup izgradnji stemera i lematizora za jezike s bogatom fleksijom i oskudnim resursima zasnovan na obuhvatanju sufiksa, Vlado Kešelj, Danko Šipka, Infoteka 9(1-2), 21-31 (2008).
http://infoteka.bg.ac.rs/pdf/Srp/2008/04%20Vlado-Danko_Stemeri.pdf
(originalna implementacija u Perlu i drugi resursi su dostupni na: http://www.cs.dal.ca/~vlado/nlp/2007-sr/)
Ovaj stemer koristi tzv. dual1 kodovanje u kome se sva ćirilična slova prevode u latinična a svako latinično slovo koje sadrži dijakritičke oznake - š, đ, č, ć, ž, dž - se piše kao skup dva latinična slova bez dijakritičkih oznaka. Pored toga, slova lj/Lj i nj/Nj se prevode u oblike ly/Ly i ny/Ny.
This class implements the greedy stemmer for Serbian described in the paper:
A Suffix Subsumption-Based Approach to Building Stemmers and Lemmatizers for Highly Inflectional Languages with Sparse Resources, Vlado Kešelj, Danko Šipka, Infotheca 9(1-2), 23a-33a (2008).
http://infoteka.bg.ac.rs/pdf/Eng/2008/INFOTHECA_IX_1-2_May2008_23a-33a.pdf
(the original implementation in Perl and other resources are available at: http://www.cs.dal.ca/~vlado/nlp/2007-sr/)
This stemmer uses the so-called dual1 coding system in which all Cyrillic letters are transformed into their Latin equivalents and every Latin letter that contains diacritical marks - š, đ, č, ć, ž, dž - is written as a set of two Latin letters without the diacritical marks. Furthermore, the letters lj/Lj and nj/Nj are transformed into the ly/Ly and ny/Ny forms.
| Modifier and Type | Field and Description |
|---|---|
private static long |
serialVersionUID |
maxSuffixLen, rules| Constructor and Description |
|---|
KeseljSipkaStemmerGreedy() |
| Modifier and Type | Method and Description |
|---|---|
protected void |
initRules()
1000 pohlepnih pravila oblika 'sufiks', 'uklonjen sufiks'
Najbolja tačnost (u izvornom radu) = 72.4589448093139 |
convertToDual1Character, stemDual1Line, stemDual1WordconvertToDual1File, convertToDual1String, convertToNormalFile, convertToNormalString, initMaxSuffixLen, stemDual1File, stemLine, stemWordgetRevision, main, replaceSpaceWithNewLine, stem, stemFile, stemTextprivate static final long serialVersionUID
protected void initRules()
1000 pohlepnih pravila oblika 'sufiks', 'uklonjen sufiks'
Najbolja tačnost (u izvornom radu) = 72.4589448093139
1000 Greedy rules, form: "suffix", "removed suffix"
Best accuracy (in the original paper) = 72.4589448093139
initRules in class SerbianStemmer