Natural Language Processing > UEA Stemmer
Ruby port of UEALite Stemmer - a conservative stemmer for search and indexing.
= uea-stemmer
Ruby implementation of the UEA-Lite stemmer for conservative stemming in search and indexing workloads.
UEA-Lite[https://web.archive.org/web/20120728132949/http://www.uea.ac.uk/cmp/research/graphicsvisionspeech/speech/WordStemming] uses a rule set to normalize suffixes while avoiding aggressive stemming.
== Behavior Notes
The stemmer operates on a single token at a time and returns a stemmed token.
Notable behavior of this implementation:
- possessive apostrophes are removed
- contractions are expanded by default (for example, don't becomes do not)
- tokens beginning with uppercase letters are preserved, and pluralized acronyms ending in a lowercase s are singularized
- pure numbers, and tokens containing hyphens/underscores, are passed through unchanged
This is a port to Ruby from the Java port of the original Perl script by Marie-Claire Jenkins and Dr. Dan J. Smith at the University of East Anglia.
== Installation
Install the gem:
gem install uea-stemmer
Install from source:
git clone https://github.com/ealdent/uea-stemmer.git cd uea-stemmer bundle install bundle exec rake test bundle exec rake install
== Example Usage
Basic usage:
require "uea-stemmer" stemmer = UEAStemmer.new
stemmer.stem("helpers") # => "helper" stemmer.stem("dying") # => "die" stemmer.stem("scarred") # => "scar"
You can extract the matching rule with +stem_with_rule+:
result = stemmer.stem_with_rule("invited") result.word # => "invite" result.rule_num # => 22.3 result.rule # => #<UEAStemmer::Rule ...>
Disable contraction expansion:
UEAStemmer.new(nil, nil, skip_contractions: true).stem("don't")
=> "don't"
Use the singleton instance:
DefaultUEAStemmer.instance.stem("running") # => "run"
== Contributing
- Fork the project.
- Make your feature addition or bug fix.
- Add or update tests.
- Run +bundle exec rake test+.
- Send me a pull request. Bonus points for topic branches.
== Relevant Web Pages
- https://web.archive.org/web/20120728132949/http://www.uea.ac.uk/cmp/research/graphicsvisionspeech/speech/WordStemming
- Stemming[https://en.wikipedia.org/wiki/Stemming]
== Copyright
Copyright (c) 2005 by the University of East Anglia and authored by Marie-Claire Jenkins and Dr. Dan J Smith. This port to Ruby was done by Jason Adams using the port to Java by Richard Churchill.
This project is distributed under the Apache 2.0 License[https://www.apache.org/licenses/LICENSE-2.0]. See LICENSE for details.