Hi,
I've been working on a project wherein I had to retrieve the text from images of research papers. The task initially looked pretty simple. There are already a couple of softwares available which I presumed would simplify the job. But, to my horror, the images which I was supposed to OCR had poor quality and very good tools like Tesseract were not able to detect the words in the images properly. This led me to first pre-process each image in a collection of nearly 4k images by enhancing them and then apply OCR to them. Finally the job was done but this compelled me to think if OCRing an image the only way to make it suitable for text retrieval. The motivation for this blog comes from this particular thought only!!
Problems causing factors while text retrieval from images using OCR:
Solutions?I've been working on a project wherein I had to retrieve the text from images of research papers. The task initially looked pretty simple. There are already a couple of softwares available which I presumed would simplify the job. But, to my horror, the images which I was supposed to OCR had poor quality and very good tools like Tesseract were not able to detect the words in the images properly. This led me to first pre-process each image in a collection of nearly 4k images by enhancing them and then apply OCR to them. Finally the job was done but this compelled me to think if OCRing an image the only way to make it suitable for text retrieval. The motivation for this blog comes from this particular thought only!!
![]() |
a) original image (un-enhanced) b) OCRed text using Tesseract |
Problems causing factors while text retrieval from images using OCR:
- Differences in text styles
- Differences in text fonts
- Background images and colours
- Orientation of the text: single column or double column
- Convert images to text before performing text retrieval on imaged data, but don't use OCR :P
- Choose softwares which keep above mentioned factors in view and then perform OCR like Tesseract. But Tesseract doesn't work on low quality images like poorly scanned documents.
Researches have been conducted in this field and the researchers have came up with many efficient ways of text retrieval from imaged data without applying OCR. One such study that I found useful is wherein the researchers formed character objects from the text present in the images and then created HTD (Horizontal Traverse Density) and VTD (Vertical Traverse Density) vectors of those objects which were used as features of those character objects. After feature extraction in this manner n-grams algorithm was used to compute the similarity and retrieve the results. The steps which they followed are detailed below:
- Preprocessing: Deskew, noise removal, removal of heading and pictures
- Feature Extraction: Connected components (CC) were analyzed to form character objects. Three types of character objects (CO) were seen-
- CO with one CC: like the character s.
- CO with more than one CC: like the characters i and j.
- CO with characters connected to each other: like the characters tt and ft.
- Vector Formation: HTD and VTD vectors for each CC were formed
- Classifiction of COs: An unsupervised classifier was build to classify different COs to various classes depending upon a distance calculated using their HTDs and VTDs.
- Applying n-gram algorithm on document vectors so created.
![]() | ||||||||||||||
HTD and VTD vectors of lower case v. [1] |
![]() |
a) Original image of the document. b) CO enclosed in bounding boxes. c) Classifying COs d) Aggregating COs of a document. [1] |
On querying, average recall percentage was 47.4% while the precision was 85%. Although the results aren't quite overwhelming but it can be said that applying such an approach was a pretty good start to text retrieval from imaged data without applying OCR.
Comments
Post a Comment