Saturday, November 29, 2008

Stemming

Stemming process plays a major role in Indexing applications. In most cases more than one word have similar semantic interpretations and can be considered as equivalent for the purpose of Information Retrieval applications .Due to that Stemmers have been developed reduce a word to it’s root form. Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form. The process of stemming, often called conflation, is useful in search engines for indexing and other natural language processing problems.

For example a stemmer for English should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", "fish", and "fisher" to the root word, "fish".

Stemming process contains few steps to stem a given word.

In the first step use for removal of plurals and -ed or -ing.

e.g.

fishing -> fish, feed -> feed, agreed -> agree

In the next step turns terminal y to i when there is another vowel in the stem. Next step use to maps double suffices to single ones. so -ization maps to -ize etc.

In the fourth step deals with -ic-, -full, -ness etc in the same way as in step three. Fifth step is used to remove -ant, -ence etc from the given word.

In indexing application when a word is given to index, application first calls the Stemmer class and then indexed the stemmes word.

By doing stemming process we can save space and reduce response time since we use single key instead of several keys.

Lucene for Indexing Applications

Lucene is a opensource information retrieval library. It is originally designed in Java, but has been ported to programming languages including Delphi, Perl, C++, Python, Ruby, PHP and C#.

It can be used for any application which requires full text indexing and searching capability.

The key classes that we will use to build a search engine.

  • Document - The Document class represents a document in Lucene. We index Document objects and get Document objects back when we do a search.
  • Field - The Field class represents a section of a Document. The Field object will contain a name for the section and the actual data.
  • Analyzer - The Analyzer class is an abstract class that used to provide an interface that will take a Document and turn it into tokens that can be indexed. There are several useful implementations of this class but the most commonly used is the StandardAnalyzer class.
  • IndexWriter - The IndexWriter class is used to create and maintain indexes.
  • IndexSearcher - The IndexSearcher class is used to search through an index.
  • QueryParser - The QueryParser class is used to build a parser that can search through an index.
  • Query - The Query class is an abstract class that contains the search criteria created by the QueryParser.
  • Hits - The Hits class contains the Document objects that are returned by running the Query object against the index.