Saturday, November 29, 2008

Stemming

Stemming process plays a major role in Indexing applications. In most cases more than one word have similar semantic interpretations and can be considered as equivalent for the purpose of Information Retrieval applications .Due to that Stemmers have been developed reduce a word to it’s root form. Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form. The process of stemming, often called conflation, is useful in search engines for indexing and other natural language processing problems.

For example a stemmer for English should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", "fish", and "fisher" to the root word, "fish".

Stemming process contains few steps to stem a given word.

In the first step use for removal of plurals and -ed or -ing.

e.g.

fishing -> fish, feed -> feed, agreed -> agree

In the next step turns terminal y to i when there is another vowel in the stem. Next step use to maps double suffices to single ones. so -ization maps to -ize etc.

In the fourth step deals with -ic-, -full, -ness etc in the same way as in step three. Fifth step is used to remove -ant, -ence etc from the given word.

In indexing application when a word is given to index, application first calls the Stemmer class and then indexed the stemmes word.

By doing stemming process we can save space and reduce response time since we use single key instead of several keys.

No comments:

Post a Comment