public abstract class SerbianStemmer extends SCStemmer
Ova apstraktna klasa implementira zajedničke funkcije za stemere za srpski opisane u radu:
Pristup izgradnji stemera i lematizora za jezike s bogatom fleksijom i oskudnim resursima zasnovan na obuhvatanju sufiksa, Vlado Kešelj, Danko Šipka, Infoteka 9(1-2), 21-31 (2008).
http://infoteka.bg.ac.rs/pdf/Srp/2008/04%20Vlado-Danko_Stemeri.pdf
(originalna implementacija u Perlu i drugi resursi su dostupni na: http://www.cs.dal.ca/~vlado/nlp/2007-sr/)
i za stemer kreiran u master radu Nikole Miloševića, opisan u ArXiv dokumentu:
Stemmer for Serbian language, Nikola Milošević, arXiv preprint arXiv:1209.4471 (2012).
http://arxiv.org/abs/1209.4471
Svi stemeri za srpski koriste tzv. dual1 kodovanje u kome se sva ćirilična slova prevode u latinična a svako latinično slovo koje sadrži dijakritičke oznake - š, đ, č, ć, ž, dž - se piše kao skup dva latinična slova bez dijakritičkih oznaka. Pored toga, kod nekih stemera se slova lj/Lj i nj/Nj prevode u oblike ly/Ly i ny/Ny.
This abstract class implements the common functions of the stemmers for Serbian described in the paper:
A Suffix Subsumption-Based Approach to Building Stemmers and Lemmatizers for Highly Inflectional Languages with Sparse Resources, Vlado Kešelj, Danko Šipka, Infotheca 9(1-2), 23a-33a (2008).
http://infoteka.bg.ac.rs/pdf/Eng/2008/INFOTHECA_IX_1-2_May2008_23a-33a.pdf
(the original implementation in Perl and other resources are available at: http://www.cs.dal.ca/~vlado/nlp/2007-sr/)
and the stemmer created in the Master's degree thesis of Nikola Milošević, described in the ArXiv paper:
Stemmer for Serbian language, Nikola Milošević, arXiv preprint arXiv:1209.4471 (2012).
http://arxiv.org/abs/1209.4471
All stemmers for Serbian use the so-called dual1 coding system in which all Cyrillic letters are transformed into their Latin equivalents and every Latin letter that contains diacritical marks - š, đ, č, ć, ž, dž - is written as a set of two Latin letters without the diacritical marks. Furthermore, in some stemmers the letters lj/Lj and nj/Nj are transformed into the ly/Ly and ny/Ny forms.
Modifier and Type | Field and Description |
---|---|
protected int |
maxSuffixLen
Dužina (u karakterima) najdužeg sufiksnog pravila
|
protected java.util.HashMap<java.lang.String,java.lang.String> |
rules
Spisak sufiksnih pravila
|
private static long |
serialVersionUID |
Constructor and Description |
---|
SerbianStemmer() |
Modifier and Type | Method and Description |
---|---|
protected abstract java.lang.String |
convertToDual1Character(int intCharacter,
char oldChar)
Konvertuje jedan karakter iz standardnog oblika (ćirilice ili latinice) u dual1 kodiranje
|
void |
convertToDual1File(java.lang.String fileInput,
java.lang.String fileOutput)
Konvertuje sadržaj zadatog ulaznog fajla iz standardnog oblika (ćirilice ili latinice) u dual1 kodiranje i upisuje ga u zadati izlazni fajl
|
java.lang.String |
convertToDual1String(java.lang.String wordOrLine)
Konvertuje zadati string (reč ili liniju teksta) iz standardnog oblika (ćirilice ili latinice) u dual1 kodiranje
|
void |
convertToNormalFile(java.lang.String fileInput,
java.lang.String fileOutput)
Konvertuje sadržaj zadatog ulaznog fajla iz dual1 kodiranja u standardni latinični oblik i upisuje ga u zadati izlazni fajl
|
java.lang.String |
convertToNormalString(java.lang.String wordOrLine)
Konvertuje zadati string (reč ili liniju teksta) iz dual1 kodiranja u standardan latinični oblik
|
protected void |
initMaxSuffixLen()
Pronalazi maksimalnu dužinu sufiksa u sufiksnim pravilima
|
protected void |
initRules()
Alocira memoriju za spisak sufiksnih pravila
|
void |
stemDual1File(java.lang.String fileInput,
java.lang.String fileOutput)
Stemuje sadržaj ulaznog fajla napisanog u dual1 kodiranju i upisuje ga u izlazni fajl
|
abstract java.lang.String |
stemDual1Line(java.lang.String line)
Stemuje liniju teksta koja je napisana u dual1 kodiranju
|
abstract java.lang.String |
stemDual1Word(java.lang.String word)
Stemuje reč koja je napisana u dual1 kodiranju
|
java.lang.String |
stemLine(java.lang.String line)
Stemuje liniju teksta koja je napisana u standardnom obliku (ćirilicom ili latinicom)
|
java.lang.String |
stemWord(java.lang.String word)
Stemuje reč koja je napisana u standardnom obliku (ćirilicom ili latinicom)
|
getRevision, main, replaceSpaceWithNewLine, stem, stemFile, stemText
private static final long serialVersionUID
protected int maxSuffixLen
Length (in characters) of the longest suffix rule
protected java.util.HashMap<java.lang.String,java.lang.String> rules
The list of suffix rules
protected void initRules()
Allocates the memory for the list of suffix rules
protected void initMaxSuffixLen()
Finds the maximal suffix length in the suffix rules
public java.lang.String stemWord(java.lang.String word)
Stems a word written in the standard form (in the Cyrillic or Latin script)
public java.lang.String stemLine(java.lang.String line)
Stems a line of text written in the standard form (in the Cyrillic or Latin script)
public abstract java.lang.String stemDual1Word(java.lang.String word)
Stems a word written in the dual1 coding system
word
- Reč koju treba stemovati
public abstract java.lang.String stemDual1Line(java.lang.String line)
Stems a line of text written in the dual1 coding system
line
- Linija teksta koju treba obraditi
public void stemDual1File(java.lang.String fileInput, java.lang.String fileOutput)
Stems the contents of the input file written in the dual1 coding system and writes them to the output file
fileInput
- Ime ulaznog fajla
fileOutput
- Ime izlaznog fajla
public java.lang.String convertToNormalString(java.lang.String wordOrLine)
Converts the given string (a word or a line of text) from the dual1 coding system into the standard Latin script form
wordOrLine
- String napisan u dual1 kodiranju
public java.lang.String convertToDual1String(java.lang.String wordOrLine)
Converts the given string (a word or a line of text) from the standard form (in the Cyrillic or Latin script) into the dual1 coding system
wordOrLine
- String napisan u standardnom obliku
public void convertToNormalFile(java.lang.String fileInput, java.lang.String fileOutput)
Converts the contents of a given input file from the dual1 coding system into the standard Latin script form and writes them into the given output file
fileInput
- Ulazni fajl čiji je sadržaj zapisan u dual1 kodiranju
fileOutput
- Izlazni fajl čiji sadržaj treba da bude zapisan u standardnom obliku
public void convertToDual1File(java.lang.String fileInput, java.lang.String fileOutput)
Converts the contents of a given input file from the standard form (in the Cyrillic or Latin script) into the dual1 coding system and writes them into the given output file
fileInput
- Ulazni fajl čiji je sadržaj zapisan u standardnom obliku
fileOutput
- Izlazni fajl čiji sadržaj treba da bude zapisan u dual1 kodiranju
protected abstract java.lang.String convertToDual1Character(int intCharacter, char oldChar)
Converts a given character from the standard form (in the Cyrillic or Latin script) to the dual1 coding system
intCharacter
- Unicode kod karaktera koji treba prevesti u dual1 sistem
oldChar
- Karakter koji je u tekstu prethodio trenutno zadatom karakteru