Extract Text from from multi-page PDF with only Images
Sometimes there are only images in a PDF. In such cases you can not select text to copy / paste or just for reference.
To extract text from an Image or a PDF containing only images, I used Tesseract OCR Engine and Ghostscript. I am running Fedora 19 at the moment, however these steps should apply to an older version of Fedora or Ubuntu. ( I believe this can be done on Windows as well ). Both Tesseract and Ghostscript are free softwares.
First, install both Tesseract and Ghostscript on Fedora:
$ sudo yum install -y ghostscript tesseract
Now go to the folder where your PDF is located ( assuming that it is named as
$ cd ~/Downloads/
Next, extract each page from PDF as a PNG. For this I used Ghostscript. Note the resolution ( -r300 ):
$ ghostscript -dNOPAUSE -dBATCH -sDEVICE=pngalpha -r300 -sOutputFile="page%03d".png story.pdf $ ls page*.png page001.png page002.png ...
Once we have a PNG for each page, we can use the OCR software to extract text:
$ for f in page*.png ; do tesseract $f $f.out; done $ ls page*.out.txt page001.png.out.txt page002.png.out.txt ...
So, now we have all the text from images into text files. Tesseract works quite well with OCR output, and obviously it cant read drawing or misprinted characters quite well, still its quite accurate.
I hope it is helpful for you.
- How to install and use tesseract
- How to convert a multi-page PDF file to PNG files, with one PNG file per page of the PDF document