public class LjubesicPandzicStemmer extends SCStemmer
Ova klasa implementira stemer za hrvatski "Simple stemmer for Croatian v0.1" Nikole Ljubešića i Ivana Pandžića. Originalna implementacija u Python-u je dostupna na adresi:
http://nlp.ffzg.hr/resources/tools/stemmer-for-croatian/
Stemer predstavlja poboljšanje ranijeg algoritma opisanog u radu:
Retrieving Information in Croatian: Building a Simple and Efficient Rule-Based Stemmer, Nikola Ljubešić, Damir Boras, Ozren Kubelka, Digital Information and Heritage, 313–320 (2007).
This class implements the "Simple stemmer for Croatian v0.1" by Nikola Ljubešić and Ivan Pandžić. The original implementation in Python is available at:
http://nlp.ffzg.hr/resources/tools/stemmer-for-croatian/
The stemmer represents an improvement of an earlier algorithm described in the paper:
Retrieving Information in Croatian: Building a Simple and Efficient Rule-Based Stemmer, Nikola Ljubešić, Damir Boras, Ozren Kubelka, Digital Information and Heritage, 313–320 (2007).
Modifier and Type | Field and Description |
---|---|
private static long |
serialVersionUID |
private java.util.HashSet<java.lang.String> |
stopset
Lista stop-reči.
|
private java.util.HashMap<java.lang.String,java.lang.String> |
transformations
Mapa sufiksnih transformacija.
|
private static java.util.regex.Pattern |
vowelPattern
Skup samoglasnika.
|
private java.util.ArrayList<java.lang.String> |
wordEnd
Lista završetaka reči.
|
private java.util.ArrayList<java.util.regex.Pattern> |
wordPatterns
Lista morfoloških obrazaca reči.
|
private java.util.ArrayList<java.lang.String> |
wordStart
Lista početnih delova reči.
|
Constructor and Description |
---|
LjubesicPandzicStemmer() |
Modifier and Type | Method and Description |
---|---|
private java.lang.String |
capitalizeSyllabicR(java.lang.String word)
Kapitalizuje slogotvorno R u zadatoj reči, ako postoji
|
private boolean |
hasAVowel(java.lang.String word)
Proverava da li reč sadrži samoglasnik/slogotvorno R
|
protected void |
initRules()
Inicijalizuje pravila za stemovanje
|
java.lang.String |
stemLine(java.lang.String line)
Stemuje liniju teksta
|
java.lang.String |
stemWord(java.lang.String word)
Ako se naiđe na neku od stop-reči, ona se preskače.
|
private java.lang.String |
transform(java.lang.String word)
Zamenjuje sufiks reči transformisanom varijantom tog sufiksa
|
getRevision, main, replaceSpaceWithNewLine, stem, stemFile, stemText
private static final long serialVersionUID
private java.util.HashMap<java.lang.String,java.lang.String> transformations
The map of suffix transformations.
private java.util.HashSet<java.lang.String> stopset
The list of stop-words. A hashset implementation was used for the sake of efficiency.
private java.util.ArrayList<java.lang.String> wordStart
The list of word beginnings.
private java.util.ArrayList<java.lang.String> wordEnd
The list of word endings.
private java.util.ArrayList<java.util.regex.Pattern> wordPatterns
The list of morphological patterns of words.
private static final java.util.regex.Pattern vowelPattern
The set of vowels.
public java.lang.String stemWord(java.lang.String word)
If a stop-word is encountered, it is skipped. Otherwise, the suffix of the word is first transformed and then removed.
public java.lang.String stemLine(java.lang.String line)
Stems a line of text
private java.lang.String transform(java.lang.String word)
Replaces the word suffix with a transformed variant of that suffix
word
- Reč koju treba obraditi
private java.lang.String capitalizeSyllabicR(java.lang.String word)
Capitalizes the syllabic R in the given word, if it exists
word
- Reč koju treba obraditi
private boolean hasAVowel(java.lang.String word)
Checks whether the word contains a vowel/syllabic R
word
- Reč koju treba obraditi