Tuesday, April 21, 2009

Did You Mean Feature for a Search Application

This article describes how the suggesting word feature is used for a Searching application which uses Lucene.Net for indexing and searching.
Suggesting word feature is mainly useful when a user input some misspelled words and to suggest the correct word to the user.

To implement the Did You Mean feature by suggesting words, n-gram method is used. n-gram method divides the given misspelled word into sub words by considering the length of the word.
The idea behind with the n-gram method for suggesting word feature is that the misspelling occurs due to mainly one or two letters. Therefore it will only affects few n-grams. Therefore we can recognize the correct word by taking the word which share high proportion of n-grams with the misspelled word.

n-grams are created considering the length of the word.
If the word length greater than five, two grams are created having the length of three and four.
If the word length is five, then the length of the grams are two and three.
If the word length is less than five, then the length of the grams are one and two.

If we consider the misspelled word "university", then the n-gram query is like
start 3: uni^2.0 end:ity gram 3:uni gram 3:niv gram 3:ive gram 3: ves gram 3:esi gram 3:sit gram 3:ity start 4:univ^2.0 end 4:sity gram 4:univ gram 4:nive gram 4:ives gram 4:vesi gram 4:esit gram 4:sity

We index the start and end n-grams seperately because they are positional unlike other n-grams.
For example the words "eat" and "ate" have the same set of n-grams.

Also we use two set of n-grams to increase the accuracy of the suggestions. For example lets consider that words "ball" and "belt" are in the index file and no "bell" word.

Then since the word length is less than five , 1-gram and 2-gram are created.

For the word "ball" grams are
1- gram = b a l l
2-gram = ba al ll

For the word "belt"
1-gram = b e l t
2-gram = be el lt

For the searching word "bell"
1-gram = b e l l
2-gram = be el ll

By considering those 1-gram and 2-grams, we can see that number of grams matching in 1-gram is same. But it is different in the 2-gram. Therefore by taking two grams we can select the closest word.

C# code for the implementation is shown below. For the implementation two classes and one interface is used.

DidYouMeanParser interface is shown below.

using System;
using System.Collections.Generic;
using System.Text;
using Lucene.Net.Search;

///
/// DidYouMeanParser interface
///

namespace Indexer
{
public interface DidYouMeanParser
{
Query parse(string queryString);
Query suggest(string queryString);
}
}


The implementation of
DidYouMeanParser interface is done by the DidYouMeanParserClass as shown below.

using System;
using System.Collections.Generic;
using System.Text;
using
Lucene.Net.Index;
using
Lucene.Net.QueryParsers;
using
Lucene.Net.Search;
using
Lucene.Net.Search.Spell;
using
Lucene.Net.Store;


///
/// Class implement the DidYouMeanParser interface
///

namespace Indexer.DidYouMean
{
public class DidYouMeanParserClass : DidYouMeanParser
{

private string defaultField;
private string spellIndexDirectory;

public DidYouMeanParserClass(string defaultField, string spellIndexDirectory)
{
this.defaultField = defaultField;
this.spellIndexDirectory = spellIndexDirectory;
}

public Query parse(string queryString)
{
return new TermQuery(new Term(defaultField, queryString));
}

public Query suggest(string queryString)
{
try
{
SpellChecker spellChecker = new SpellChecker(spellIndexDirectory);
if (spellChecker.Exist(queryString))
{
return null;
}
string[] similarWords = spellChecker.SuggestSimilar(queryString, 1);
if (similarWords.Length == 0)
{
return null;
}
return new TermQuery(new Term(defaultField, similarWords[0]));
}
catch (Exception e)
{
throw new ParseException(e.Message);
}
}
}

}


To compare the misspelled word we use the words in the in index file. For that we create a temporary index file which having the n-grams as fields using the existing index file.
To create that DidYouMeanIndexer class is used as shown below.

using System;
using System.Collections.Generic;
using System.Text;
using Lucene.Net.Index;
using
Lucene.Net.Search;
using
Lucene.Net.Search.Spell;

///
/// Class use to create the temperorary index file for DidYouMeanParser
///

namespace Indexer
{
public class DidYouMeanIndexer
{

public void createSpellIndex(string field,
string originalIndexDirectory,
string spellIndexDirectory)
{

IndexReader indexReader = null;
try
{
indexReader = IndexReader.Open(originalIndexDirectory);
Dictionary dictionary = new LuceneDictionary(indexReader, field);
SpellChecker spellChecker = new SpellChecker(spellIndexDirectory);
spellChecker.IndexDictionary(dictionary);
}
finally
{
if (indexReader != null)
{
indexReader.Close();
}
}
}

}
}


After that we can provide the Did You mean feature by passing the misspelled words to the DidYouMeanParser class.


No comments:

Post a Comment