public abstract class KeseljSipkaStemmer extends SerbianStemmer
Ova apstraktna klasa implementira zajedničke funkcije za stemere za srpski opisane u radu:
Pristup izgradnji stemera i lematizora za jezike s bogatom fleksijom i oskudnim resursima zasnovan na obuhvatanju sufiksa, Vlado Kešelj, Danko Šipka, Infoteka 9(1-2), 21-31 (2008).
http://infoteka.bg.ac.rs/pdf/Srp/2008/04%20Vlado-Danko_Stemeri.pdf
(originalna implementacija u Perlu i drugi resursi su dostupni na: http://www.cs.dal.ca/~vlado/nlp/2007-sr/)
Ovi stemeri koriste tzv. dual1 kodovanje u kome se sva ćirilična slova prevode u latinična a svako latinično slovo koje sadrži dijakritičke oznake - š, đ, č, ć, ž, dž - se piše kao skup dva latinična slova bez dijakritičkih oznaka. Pored toga, slova lj/Lj i nj/Nj se prevode u oblike ly/Ly i ny/Ny.
This abstract class implements the common functions of the stemmers for Serbian described in the paper:
A Suffix Subsumption-Based Approach to Building Stemmers and Lemmatizers for Highly Inflectional Languages with Sparse Resources, Vlado Kešelj, Danko Šipka, Infotheca 9(1-2), 23a-33a (2008).
http://infoteka.bg.ac.rs/pdf/Eng/2008/INFOTHECA_IX_1-2_May2008_23a-33a.pdf
(the original implementation in Perl and other resources are available at: http://www.cs.dal.ca/~vlado/nlp/2007-sr/)
These stemmers use the so-called dual1 coding system in which all Cyrillic letters are transformed into their Latin equivalents and every Latin letter that contains diacritical marks - š, đ, č, ć, ž, dž - is written as a set of two Latin letters without the diacritical marks. Furthermore, the letters lj/Lj and nj/Nj are transformed into the ly/Ly and ny/Ny forms.
Modifier and Type | Field and Description |
---|---|
private static long |
serialVersionUID |
maxSuffixLen, rules
Constructor and Description |
---|
KeseljSipkaStemmer() |
Modifier and Type | Method and Description |
---|---|
protected java.lang.String |
convertToDual1Character(int intCharacter,
char oldChar)
Konvertuje jedan karakter iz standardnog oblika (ćirilice ili latinice) u dual1 kodiranje
|
java.lang.String |
stemDual1Line(java.lang.String line)
Stemuje liniju teksta koja je napisana u dual1 kodiranju
|
java.lang.String |
stemDual1Word(java.lang.String word)
Stemuje reč koja je napisana u dual1 kodiranju
|
convertToDual1File, convertToDual1String, convertToNormalFile, convertToNormalString, initMaxSuffixLen, initRules, stemDual1File, stemLine, stemWord
getRevision, main, replaceSpaceWithNewLine, stem, stemFile, stemText
private static final long serialVersionUID
public java.lang.String stemDual1Word(java.lang.String word)
SerbianStemmer
Stems a word written in the dual1 coding system
stemDual1Word
in class SerbianStemmer
word
- Reč koju treba stemovati
public java.lang.String stemDual1Line(java.lang.String line)
SerbianStemmer
Stems a line of text written in the dual1 coding system
stemDual1Line
in class SerbianStemmer
line
- Linija teksta koju treba obraditi
protected java.lang.String convertToDual1Character(int intCharacter, char oldChar)
SerbianStemmer
Converts a given character from the standard form (in the Cyrillic or Latin script) to the dual1 coding system
convertToDual1Character
in class SerbianStemmer
intCharacter
- Unicode kod karaktera koji treba prevesti u dual1 sistem
oldChar
- Karakter koji je u tekstu prethodio trenutno zadatom karakteru