Information retrieval with OCR errors

A large number of documents on the web are not textual but images of textual content for example JPEG etc. They may also consist of scanned documents.
Optical character recognition (OCR) is used to extract textual content from these images so that they can be treated as normal textual data but OCR doesn't handle such documents appropriately because of errors in recognition of characters.

However, it has been found that OCR makes some mistakes very frequently in recognition of all document Images. This allows the correction of these errors without relying on a lexicon. First, we generate a collection of error-grams and correction rules and for the second step, we use query terms, error-grams and correction rules for information retrieval.


The problem with document images is due to noise and poor contrast in images and compression artefacts.
many features need to be extracted such as intensity, texture etc to distinguish text from document images.

OCR errors include operations such as character substitution, deletion and insertion. these make it difficult to arrive at an OCR result with high accuracy.
The rate of errors increases with decreasing image quality. These error types are classified separately.
Trivial solutions for this is to use a lexicon for correction of words but this leads to many false corrections as a lexicon is not able to
cover everything.

Edit-distance can be used to reduce these losses. locating OCR errors and to collect error grams and correction rules and focus on a set of common n-gram errors. Error-grams are found by the application of edit distance at levels denoted by the number of correction applied.
A set of documents (text) and their document images were collected and OCR was applied to get the text from the images. Using string matching techniques, the OCR text was mapped to the correct text and error-grams were collected.
let X be the set of words in original document and X' be the set of words in OCR document.
correctly recognized words= Mrec= (X-X')
remaining words(original) = Mremo=(X-Mrec)
remaining words(OCR) =Mremr=(X'-Mrec)
for each word in Mremo, scan word in Mremr and the select word which has minimum edit distance.
locate errors and verify quality and accuracy of matching. Construct error-grams by using character above and below the confused character.
The correction rules formed denote the probability that a character A in doc image can be regarded as B obtained by OCR which is calculated using the Bayesian rules.
P(B/A)=(#(AB)/#A) where #(AB) is the occurrence of interchange, decomposition etc. To improve the cost of error n-grams, an iterative technique is used to generate new matching. this will increase the recognition of the original text.

Retrieving documents
string matching is an essential tool in IR applications but classical string matching cannot be applied effectively to document images
due to the errors which make exact string matching inappropriate.
First, we expand the query search terms and assign weights to the obtained list. then-candidate terms are selected. collection of document images are then searched and their performance is measured.

Query expansion
a word is treated not as a single unit but as a set of overlapping n-grams.  a set of expanded search terms are generated by substitution of error-grams in the term.

The above technique offers significant improvement in OCR document retrieval compared to exact matching, partial matching and 3-gram overlaps with recall 99.08% and precision 91.22%.
The method collects frequent error-grams and correction rules using which query terms are extended and retrieval performance is improved.


References:
http://rali.iro.umontreal.ca/rali/sites/default/files/publis/DIAR_05.pdf
https://www.cs.cmu.edu/~callan/Papers/forum02.pdf

Comments