Project Awesome project awesome

Natural Language Processing > UEA Stemmer

Ruby port of UEALite Stemmer - a conservative stemmer for search and indexing.

Package 54 stars GitHub

= uea-stemmer

Ruby implementation of the UEA-Lite stemmer for conservative stemming in search and indexing workloads.

UEA-Lite[https://web.archive.org/web/20120728132949/http://www.uea.ac.uk/cmp/research/graphicsvisionspeech/speech/WordStemming] uses a rule set to normalize suffixes while avoiding aggressive stemming.

== Behavior Notes

The stemmer operates on a single token at a time and returns a stemmed token.

Notable behavior of this implementation:

  • possessive apostrophes are removed
  • contractions are expanded by default (for example, don't becomes do not)
  • tokens beginning with uppercase letters are preserved, and pluralized acronyms ending in a lowercase s are singularized
  • pure numbers, and tokens containing hyphens/underscores, are passed through unchanged

This is a port to Ruby from the Java port of the original Perl script by Marie-Claire Jenkins and Dr. Dan J. Smith at the University of East Anglia.

== Installation

Install the gem:

gem install uea-stemmer

Install from source:

git clone https://github.com/ealdent/uea-stemmer.git cd uea-stemmer bundle install bundle exec rake test bundle exec rake install

== Example Usage

Basic usage:

require "uea-stemmer" stemmer = UEAStemmer.new

stemmer.stem("helpers") # => "helper" stemmer.stem("dying") # => "die" stemmer.stem("scarred") # => "scar"

You can extract the matching rule with +stem_with_rule+:

result = stemmer.stem_with_rule("invited") result.word # => "invite" result.rule_num # => 22.3 result.rule # => #<UEAStemmer::Rule ...>

Disable contraction expansion:

UEAStemmer.new(nil, nil, skip_contractions: true).stem("don't")

=> "don't"

Use the singleton instance:

DefaultUEAStemmer.instance.stem("running") # => "run"

== Contributing

  • Fork the project.
  • Make your feature addition or bug fix.
  • Add or update tests.
  • Run +bundle exec rake test+.
  • Send me a pull request. Bonus points for topic branches.

== Relevant Web Pages

== Copyright

Copyright (c) 2005 by the University of East Anglia and authored by Marie-Claire Jenkins and Dr. Dan J Smith. This port to Ruby was done by Jason Adams using the port to Java by Richard Churchill.

This project is distributed under the Apache 2.0 License[https://www.apache.org/licenses/LICENSE-2.0]. See LICENSE for details.

Back to Machine Learning