Do you ever wonder how Google knows when your spelling goes wrong? Or suggests something if a word doesn't fit in the sentence? Its all done by a spell checker which is a program that detects incorrect or misspelled words and offers their alternatives to the users. They can be a part of another big application like Microsoft word or can be added as an extension in your browser or a standalone system.
Basic functionalities performed by a spell checker includes:
1. Scanning the input text.
2. Detecting the incorrectly spelled words. This is done by comparing every word with a dictionary of known words to find non-dictionary words.
3. Considering different forms of the word like verbal forms and plurals.
Thus the basic task of automatic spelling correction is to detect misspelled words and suggest proper alternatives for such words by using a collection of words that have minimum distance (lexically) to the incorrect ones and correct it by substituting it with the most appropriate alternative. But one of the major downfalls of these systems is that the required solution may have maximum distant lexically, but it's the most appropriate alternative with respect to the context of the sentence. Hence, the spell corrector must know something about the general context of the paragraph.
A combination of lexical correction using edit distance and context-sensitive correction using a well-defined corpus made by combining documents from various sources on the internet seems promising in solving the problem.
1. Levenshtein’s Distance
Also known as edit distance calculates the number of steps or modifications required to change a string s1 to another string s2. These modifications typically include insertion of a new character, deletion of an existing character or replacing character ‘a’ by ‘b’.
Thus the words having minimum edit distance are the most suited alternatives for the misspelled word lexically.
2. Learning the context’s model
The goal of a language model is to calculate the related probabilities of the words, that is estimated from the learning corpus and are further compared to those words which are observed in a text.
For each word, its relative likelihood is estimated in differing setting paying little respect to the neighboring words.
3. Detecting the context
Here, we center around detecting the context of the misspelled word. Numerous subjects can be examined in the same record by portioning the content into sections in light of the carriage return (/r). At this time, punctuations are removed and terms with no significant information for example articles, pronouns etc are also removed keeping the informative words. Then the corresponding context of each paragraph is identified.
To identify the context, following formula could be used:
where,
Parag{w1,w2…wn}: a paragraph of words w.
Freqwi : number of occurences of wi in Parag
P(wi/context j) : relative probability of wi in the context j
C {c1,c2,c3… cm} : the complete context.
4. Correction
To remedy the mistake as indicated by the context, we start with a total examination of the content to identify the mistaken words and correct the right ones that convey some data. At that point, we detect the context of the content that we utilized as a base to correct and sort the proposed solution.
After identifying the keywords making up the corpus, the occurrences of each word in the context can be calculated to determine the relative probability of each word based on the following formula:
Finally, a rich and well-trained learning corpus is ready.
References and image source:
Nejja, Mohammed, and Abdellah Yousfi. "The context in automatic spell correction." Procedia Computer Science 73 (2015): 109-114.
Comments
Post a Comment