public class MilosevicStemmer extends SerbianStemmer
Ova klasa implementira stemer za srpski kreiran u master radu Nikole Miloševića, opisan u ArXiv dokumentu:
Stemmer for Serbian language, Nikola Milošević, arXiv preprint arXiv:1209.4471 (2012).
http://arxiv.org/abs/1209.4471
Ovaj stemer koristi blago modifikovanu verziju tzv. dual1 kodovanja u kome se sva ćirilična slova prevode u latinična a svako latinično slovo koje sadrži dijakritičke oznake - š, đ, č, ć, ž, dž - se piše kao skup dva latinična slova bez dijakritičkih oznaka. Za razliku od stemera Kešelja i Šipke, ovde se slova lj/Lj i nj/Nj NE prevode u oblike ly/Ly i ny/Ny.
Napomena - implementacija algoritma se u nekim aspektima razlikuje od one iz ArXiv dokumenta - videti kod za detalje
This class implements the stemmer created in the Master's degree thesis of Nikola Milošević, described in the ArXiv paper:
Stemmer for Serbian language, Nikola Milošević, arXiv preprint arXiv:1209.4471 (2012).
http://arxiv.org/abs/1209.4471
This stemmer uses a slightly modified version of the so-called dual1 coding system in which all Cyrillic letters are transformed into their Latin equivalents and every Latin letter that contains diacritical marks - š, đ, č, ć, ž, dž - is written as a set of two Latin letters without the diacritical marks. In contrast to the stemmers of Kešelj and Šipka, the letters lj/Lj and nj/Nj are NOT transformed into the ly/Ly and ny/Ny forms here.
Note - the algorithm implementation differs in certain aspects from the one in the ArXiv document - see the code for details
Modifier and Type | Field and Description |
---|---|
private java.util.HashMap<java.lang.String,java.lang.String> |
dictionary
Rečnik koji se koristi za normalizovanje često korišćenih nepravilnih glagola
|
private static long |
serialVersionUID |
maxSuffixLen, rules
Constructor and Description |
---|
MilosevicStemmer() |
Modifier and Type | Method and Description |
---|---|
protected java.lang.String |
convertToDual1Character(int intCharacter,
char oldChar)
Konvertuje jedan karakter iz standardnog oblika (ćirilice ili latinice) u dual1 kodiranje.
|
protected void |
initRules()
Milošević: Currently 285 rules
|
java.lang.String |
stemDual1Line(java.lang.String line)
Stemuje liniju teksta koja je napisana u dual1 kodiranju
|
java.lang.String |
stemDual1Word(java.lang.String word)
U ovoj implementaciji je donekle izmenjen originalni Miloševićev algoritam, tako da više liči na stemere Kešelja i Šipke.
|
convertToDual1File, convertToDual1String, convertToNormalFile, convertToNormalString, initMaxSuffixLen, stemDual1File, stemLine, stemWord
getRevision, main, replaceSpaceWithNewLine, stem, stemFile, stemText
private static final long serialVersionUID
private java.util.HashMap<java.lang.String,java.lang.String> dictionary
A dictionary which is used to normalize the frequently used irregular verbs
public java.lang.String stemDual1Word(java.lang.String word)
In this implementation the original Milošević's algorithm is somewhat altered, so it looks more like the stemmers of Kešelj and Šipka. The reason for this is that two issues were detected:
stemDual1Word
in class SerbianStemmer
word
- Reč koju treba stemovati
public java.lang.String stemDual1Line(java.lang.String line)
SerbianStemmer
Stems a line of text written in the dual1 coding system
stemDual1Line
in class SerbianStemmer
line
- Linija teksta koju treba obraditi
protected java.lang.String convertToDual1Character(int intCharacter, char oldChar)
Converts a given character from the standard form (in the Cyrillic or Latin script) to the dual1 coding system. Milošević's stemmer does not implement the full dual1 coding system - it dispenses with the conversion of 'lj' and 'nj' into 'ly' and 'ny', making this function slighty different to the one used for the stemmers of Kešelj and Šipka
convertToDual1Character
in class SerbianStemmer
intCharacter
- Unicode kod karaktera koji treba prevesti u dual1 sistem
oldChar
- Karakter koji je u tekstu prethodio trenutno zadatom karakteru
protected void initRules()
initRules
in class SerbianStemmer