Text Retrieval from Imaged Data without using OCR

Hi,

I've been working on a project wherein I had to retrieve the text from images of research papers. The task initially looked pretty simple. There are already a couple of softwares available which I presumed would simplify the job. But, to my horror, the images which I was supposed to OCR had poor quality and very good tools like Tesseract were not able to detect the words in the images properly. This led me to first pre-process each image in a collection of nearly 4k images by enhancing them and then apply OCR to them. Finally the job was done but this compelled me to think if OCRing an image the only way to make it suitable for text retrieval. The motivation for this blog comes from this particular thought only!!

a) original image (un-enhanced) b) OCRed text using Tesseract

Problems causing factors while text retrieval from images using OCR:

Differences in text styles
Differences in text fonts
Background images and colours
Orientation of the text: single column or double column

Solutions?

Convert images to text before performing text retrieval on imaged data, but don't use OCR :P
Choose softwares which keep above mentioned factors in view and then perform OCR like Tesseract. But Tesseract doesn't work on low quality images like poorly scanned documents.

Don't use OCR and perform Text Retrieval on imaged data. Yes, it's possible :D

Researches have been conducted in this field and the researchers have came up with many efficient ways of text retrieval from imaged data without applying OCR. One such study that I found useful is wherein the researchers formed character objects from the text present in the images and then created HTD (Horizontal Traverse Density) and VTD (Vertical Traverse Density) vectors of those objects which were used as features of those character objects. After feature extraction in this manner n-grams algorithm was used to compute the similarity and retrieve the results. The steps which they followed are detailed below:

Preprocessing: Deskew, noise removal, removal of heading and pictures
Feature Extraction: Connected components (CC) were analyzed to form character objects. Three types of character objects (CO) were seen-

CO with one CC: like the character s.
CO with more than one CC: like the characters i and j.
CO with characters connected to each other: like the characters tt and ft.

Vector Formation: HTD and VTD vectors for each CC were formed
Classifiction of COs: An unsupervised classifier was build to classify different COs to various classes depending upon a distance calculated using their HTDs and VTDs.

HTD and VTD vectors of lower case v. [1]

Applying n-gram algorithm on document vectors so created.

a) Original image of the document. b) CO enclosed in bounding boxes. c) Classifying COs d) Aggregating COs of a document. [1]

On querying, average recall percentage was 47.4% while the precision was 85%. Although the results aren't quite overwhelming but it can be said that applying such an approach was a pretty good start to text retrieval from imaged data without applying OCR.

REFERENCE:

[1] Imaged document text retrieval without OCR
[2] Text Based Approach For Indexing And Retrieval Of Image And Video: A Review
[3] Probabilistic Retrieval of OCR Degraded Text using N-Grams

IIITD IR MELANAGE

Search This Blog

Text Retrieval from Imaged Data without using OCR

REFERENCE:

Comments

Post a Comment