Paraphrase

Part-of-speech tag-supported short-text semantic similarity (POST STSS)

POST STSS is a method of computing short-text semantic similarity (i.e. semantic textual similarity) that uses a bag-of-words approach and relies on string overlap measures and lexical distributional semantics. Similarities between individual words are weighted according to their parts of speech. The optimal POS weights are determined using an incremental, hill climbing-based technique. The only language-specific resource POST STSS requires is a part-of-speech tagger (and optionally a lemmatizer), making it applicable to most languages. Further information about the algorithm can be found in the 2015 ComSIS paper. POST STSS is implemented within the STSFineGrain package.

Language-independent Short-Text Semantic Similarity (LInSTSS)

LInSTSS is a method of computing short-text semantic similarity (i.e. semantic textual similarity) that uses a bag-of-words approach and relies on string overlap measures and lexical distributional semantics. Similarities between individual words are weighted according to word frequencies. Since it does not use any language-specific tools or resouces, LInSTSS is easily applicable to any language. Further information about the algorithm can be found in the 2013 Decision Support Systems paper. LInSTSS is implemented within the STSFineGrain package.

The Serbian Paraphrase Corpus (paraphrase.sr)

The Serbian Paraphrase Corpus – paraphrase.sr (ISLRN 192-200-046-033-9) consists of 1194 pairs of sentences gathered from news sources on the web. Each sentence pair was manually annotated with a binary similarity score that indicates whether the sentences in the pair are semantically similar enough to be considered close paraphrases. The construction of this corpus is described in the 2011 TELFOR paper and the 2013 Decision Support Systems paper.