STSFineGrain - a collection of STS models and a framework for their evaluation on STS corpora with fine-grained similarity scores

This package contains Java implementations of three baseline unsupervised STS models and four bag-of-words supervised STS models. STSFineGrain includes the following models:

Unsupervised models

  1. Word overlap
  2. Mean of word2vec word vectors
  3. Mixture of models 1 and 2

Supervised models

  1. Islam and Inkpen
  2. LInSTSS
  3. POST STSS
  4. POS-TF STSS

Please see the References section for papers describing all of the aforementioned models. Note that POST STSS and POS-TF STSS rely on a language-specific POS weighting scheme. The STSFineGrain package currently supports applying these models to texts in Serbian and English. Other implemented models do not have such language-related restrictions.

All models expect the input text to be formatted in UTF-8. The term frequency calculation output is also encoded in UTF-8, while model evaluation outputs are ANSI-encoded.

Fine-grained gold standard similarity scores are required in the evaluation of all models and the training of supervised ones.

Command-line interface

The supplied STSFineGrain.jar file makes it possible to use the STSFineGrain framework from the command line. The framework is invoked using the following general command form:

java -jar STSFineGrain.jar ActionID ActionSpecificArguments

ActionID can be:

Term frequency calculation

If ActionID is 0, the command should the following form:

java -jar STSFineGrain.jar 0 InputCorporaPaths OutputTFPath

STS model evaluation

If ActionID is 1, the command should have the following form:

java -jar STSFineGrain.jar 1 STSModelIndexNo EvaluationModeIndexNo LanguageCode STSCorpusRawTextsPath STSCorpusScoresPath Word2VecVectorsPath TermFrequenciesPath STSCorpusMSDorPOSPath

References

If you wish to use this package in your paper or project, please include a reference to the following paper in which it was presented:

Fine-grained Semantic Textual Similarity for Serbian, Vuk Batanović, Miloš Cvetanović, Boško Nikolić, in Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), pp. 1370-1378, Miyazaki, Japan (2018).

Be sure to also cite the original paper of each STS model you use:

Additional Documentation

Some non-trivial parts of the source code contain comments and some documentation in English. If you have any questions about the models’ functioning, please review the source code, and the papers listed above. If no answer can be found, feel free to contact me at: vuk.batanovic / at / ic.etf.bg.ac.rs

License

See the license file for licensing information.