SPELLING CORRECTION IN INFORMATION RETRIEVAL

Spelling correction 


                                    Image result for funny spelling mistakes

Incorrect spellings can cause a lot of problems. The problem is not only existent in the queries but some documents as well. This blog will discuss ways to correct spellings in not only user based mistakes but also computer-based automatically scanned and generated textual data.

There are basically two methods of spelling correction:
  • Context Sensitive
  • Isolated spelling correction
Isolated spelling correction :
In this method, the purpose of correction is not to form a grammatically correct sentence but just to correct the "wrong" word.
There are various using which this can be achieved.

The basic method is to use Edit Distance to achieve the same.This is done by getting the wrong word and predicting the correct word by getting the edit distances of the wrong word to list of correct words.Some extra weight can also be given to edit distances by giving more weight to letters that are near each other in case of query words and similar looking words in case of scanned text data (OCR).
A simple method has been implemented in python the link is attached below.

https://norvig.com/spell-correct.html
Edit Distance Algorithm 


Another method to do the same is to use N-gram. Each term of the incorrect word can be split into N parts and then a separate dictionary/ postings list can be maintained for the same.Whenever an incorrect word is detected, its n-grams can be searched and then the most occurring term in the union of all the n-grams can be given as the result.
A simple method has been implemented in python the link is attached below.

https://gist.github.com/bgreenlee/1321254


Another method that can be used to correct the spelling of the word by using clustering. This method is very beneficial and faster for finding words of the same domain.The basic method is to cluster similar words and then assign a leader to each cluster.While predicting the correct word take the similarity to all the leaders and then recursivelsy find the similarity to all the words to the closest leader.

The nltK library is an open source library which does the same.

There are some other machine learning algorithms for isolated spelling correction.

Context sensitive:
This type of spelling correction is done if you need to have grammatically correct sentences.This is generally done in query terms.

The most basic methods is to look for at all the terms in the sentence and find all the possibilities for the words and give out the sentence which is most searched upon in the query database.

There are various other context-sensitive spelling correction methods. 
Some of the open source software for context-sensitive spelling correction are:

Winnow https://www.codeproject.com/.../Context-Sensitive-Spelling-Correction-using-Winnow
Gecco https://pypi.python.org/pypi/Gecco/0.2.3

Spelling correction is a great way so the document looks more professional unlike the first image and queries
can be made better so that correct documents are retrieved. 



References:
OCR POST-PROCESSING ERROR CORRECTION ALGORITHM
USING GOOGLE'S ONLINE SPELLING SUGGESTION
Youssef Bassil, Mohammad Alwani
LACSC – Lebanese Association for Computational Sciences
Registered under No. 957, 2011, Beirut, Lebanon
youssef.bassil@lacsc.org, mohammad.alwani@lacsc.org

https://norvig.com/spell-correct.html

https://nlp.stanford.edu/IR-book/html/htmledition/contents-1.html

https://pypi.python.org/pypi/Gecco/0.2.3

https://www.codeproject.com/.../Context-Sensitive-Spelling-Correction-using-Winnow





Comments