Image cleaning before OCR application

I have been experimenting with PyTesser for the past couple of hours and it is a really nice tool. Couple of things I noticed about the accuracy of PyTesser:

  • File with icons, images and text - 5-10% accurate
  • File with only text(images and icons erased) - 50-60% accurate
  • File with stretching(And this is the best part) - Stretching file in 2) above on x or y axis increased the accuracy by 10-20%
  • So apparently Pytesser does not take care of font dimension or image stretching. Although there is much theory to be read about image processing and OCR, are there any standard procedures of image cleanup(apart from erasing icons and images) that needs to be done before applying PyTesser or other libraries irrespective of the language?

    ...........

    Wow, this post is quite old now. I started my research again on OCR these last couple of days. This time I chucked PyTesser and used the Tesseract Engine with ImageMagik instead. Coming straight to the point, this is what I found:

    1) You can increase the resolution with ImageMagic(There are a bunch of simple shell commands you can use)
    2) After increasing the resolution, the accuracy went up by 80-90%.
    

    So the Tesseract Engine is without doubt the best open source OCR engine in the market. No prior image cleaning was required here. The caveat is that it does not work on files with a lot of embedded images and I coudn't figure out a way to train Tesseract to ignore them. Also the text layout and formatting in the image makes a big difference. It works great with images with just text. Hope this helped.


    Not sure if your intent is for commercial use or not, But this works wonders if your performing OCR on a bunch of like images.

    http://www.fmwconcepts.com/imagemagick/textcleaner/index.php

    ORIGINAL 原版的

    After Pre-Processing with given arguments.

    用给定的参数进行预处理之后。


    As it turns out, tesseract wiki has an article that answers this question in best way I can imagine:

  • Illustrated guide about "Improving the quality of the [OCR] output".

  • Question "image processing to improve tesseract OCR accuracy" may also be of interest.


  • (initial answer, just for the record)

    I haven't used PyTesser , but I have done some experiments with tesseract (version: 3.02.02 ).

    If you invoke tesseract on colored image, then it first applies global Otsu's method to binarize it and then actual character recognition is run on binary (black and white) image.

    Image from: http://scikit-image.org/docs/dev/auto_examples/plot_local_otsu.html

    大津的门槛图

    As it can be seen, 'global Otsu' may not always produce desirable result.

    To better understand what tesseract 'sees' is to apply Otsu's method to your image and then look at the resulting image.

    In conclusion: the most straightforward method to improve recognition ratio is to binarize images yourself (most likely you will have find good threshold by trial and error) and then pass those binarized images to tesseract .

    Somebody was kind enough to publish api docs for tesseract, so it is possible to verify previous statements about processing pipeline: ProcessPage -> GetThresholdedImage -> ThresholdToPix -> OtsuThresholdRectToPix


    I know it's not a perfect answer. But I'd like to share with you a video that I saw from PyCon 2013 that might be applicable. It's a little devoid of implementation details, but just might be some guidance/inspiration to you on how to solve/improve your problem.

    Link to Video

    Link to Presentation

    And if you do decide to use ImageMagick to pre-process your source images a little. Here is question that points you to nice python bindings for it.

    On a side note. Quite an important thing with Tesseract. You need to train it, otherwise it wont be nearly as good/accurate as it's capable of being.

    链接地址: http://www.djcxy.com/p/96734.html

    上一篇: Tesseract OCR,Python和Windows XP

    下一篇: OCR应用程序之前的图像清理