How to turn a pdf into a text searchable pdf?

I have a number of scanned documents in pdf and I want to be able to search them. How can I do that?

Essentially I have to OCR the pdf and then blend the extracted text back into a new pdf. I have unsuccesfully tried a number of different solutions (including the ones found in Adding OCR info to a PDF).

  1. pdfocr (which gives me this issue: https://github.com/gkovacs/pdfocr/issues/7)
  2. pdfsandwich (of which the software center says it is a poor package and I should not install it)
  3. OCRfeeder (in the software center) exports to odt nicely, but does not react when exporting to pdf.
  4. Gscan2pdf exports an all black (but searchable) image as reported in this discussion.
  5. I don’t think Pdfxchange viewer can handle doing ocr on the fly on files over 500 pages.

Is there a software package I am unaware of? Or a script that does this?

Here is Solutions:

We have many solutions to this problem, But we recommend you to use the first solution because it is tested & true solution that will 100% work for you.

Solution 1

As of Ubuntu 16.04 OCRmyPDF has become available through apt. Just run

sudo apt install ocrmypdf
ocrmypdf -h   # to see the usage

Finally you can OCR your pdf with the command:

ocrmypdf input.pdf output.pdf  # change input and output to the files you want

If it seems the command is unresponsive, you can increase the verbosity using the -v flag (which can be used incrementally as -vv or -vvv). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:

pdftk A=input.pdf cat A1-5 output output.pdf

If you have any question have a look in the Github repo.

Solution 2

@don.joey answered with the ocrmypdf script. However, it can be installed directly now (from 16.10 onwards).

sudo apt install ocrmypdf

Then you have to install the tesseract languages you need.

To list which languages are already in your system, type:

tesseract --list-langs

In case you miss one, install it. For instance,

sudo apt install tesseract-ocr-spa

Now you can produce a searchable PDF (whose quality will vary, depending on the scanned document) with the following command

ocrmypdf -l 'spa' old.pdf new.pdf

You can, of course, check its man page for some additional options.

Solution 3

pdfsandwich performs exactly this job. I wasn’t aware that there is a package provided in the software center, but I’m providing Ubuntu deb packages for it on the project website (see http://www.tobias-elze.de/pdfsandwich/ for details), including the currently most recent version (0.1.2), which is unlikely to be in any software center yet.

If you have a scanned file scanned_file.pdf, simply call

pdfsandwich scanned_file.pdf

which generates the file scanned_file_ocr.pdf with the recognized text added to the scanned pages.

Compared to most existing solutions, it autodetects the tesseract version installed and adapts its behavior accordingly. In addition, it performs preprocessing of the scanned images prior to the OCR process, such as de-skewing or removal of dark edges etc., which can considerably improve optical character recognition.

DISCLAIMER: I’m the developer of pdfsandwich and therefore heavily biased.

Solution 4

I had this same problem so I wrote this over the weekend. Give it a shot; it works great! It is a simple wrapper around tesseract. It uses pdftoppm to convert a PDF into a bunch of TIFF files, then it uses tesseract to perform OCR (Optical Character Recognition) on them and produce a searchable PDF as output. All intermediate temporary files are automatically deleted when the script completes.

Source code: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF

Instructions to install & use pdf2searchablepdf:

Tested on Ubuntu 18.04 on 11 Nov 2019.

Install:

git clone https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF.git
./PDF2SearchablePDF/install.sh
sudo apt update
sudo apt install tesseract-ocr

Use:

# General:
pdf2searchablepdf [options] <input.pdf|dir_of_imgs> [lang]

# Make a PDF searchable:
pdf2searchablepdf mypdf.pdf

# Make an entire directory of images into a single searchable PDF:
pdf2searchablepdf directory_of_imgs

You’ll now have a pdf called mypdf_searchable.pdf, which contains searchable text!

Done. The wrapper has no python dependencies, as it’s currently written entirely in bash.

References or Related Resources:

  1. PDF2SearchablePDF: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF
  2. How to turn a pdf into a text searchable pdf?
  3. What's the best, simplest OCR solution?
  4. Extracting embedded images from a PDF
  5. pdfsandwich: Alternative software wrapper I just discovered, that is worth checking out too! http://www.tobias-elze.de/pdfsandwich/
  6. https://unix.stackexchange.com/questions/301318/how-to-ocr-a-pdf-file-and-get-the-text-stored-within-pdf/551526#551526
  7. [how to turn a PDF into a bunch of images with pdftoppm] Extracting embedded images from a PDF

Solution 5

OS: Ubuntu 18.04

First, install tesseract-ocr with:

apt-cache show tesseract-ocr
sudo apt-get update && sudo apt-get upgrade
apt-get install tesseract-ocr --print-uris
apt-get install tesseract-ocr
sudo !!

If you are going to use a language other than English with tesseract, then you will have to install the corresponding laguage package. For example for Portuguese, you will need to do:

sudo apt-get install tesseract-ocr-por

Otherwise you’ll get the error:

Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/por.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your 
"tessdata" directory.
Failed loading language 'por'
Tesseract couldn't load any languages!
Could not initialize tesseract.

If you Google “tesseract PDF” you will probably find this somewhat outdated post. However, it gives you some useful hints. You will first have to convert your .pdf file to a .tiff one. Run:

convert -density 125 originalfile.pdf -depth 8 -alpha Off newfile.tiff

If, as in the outdated post, you forget to add alpha -Off , you’ll get the following error:

Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Error in pixReadFromTiffStream: spp not in set {1,3,4}

Now you can run the final command. In the particular case that your original PDF is in Portuguese, you will need this command:

tesseract -l por newfile.tiff output pdf 

The generated file will be named output.pdf . If, for example, your PDF is in French, after you install the corresponding tesseract-ocr-fra , you will run:

tesseract -l fra newfile.tiff output pdf 

And the desired file will be, again, output.pdf .

Solution 6

OCRfeeder has a bug in

/usr/lib/python2.7/dist-packages/reportlab/pdfgen/textobject.py

line 436 should read:

            lines = asUnicode(stuff).strip().split('\n')
# bug here, was:
#            lines = '\n'.split(asUnicode(stuff).strip())

changed this and it worked for me

Solution 7

As of Ubuntu 16.04, OCRmyPDF has become available through apt. Just run the following command to install it:

sudo apt install ocrmypdf

You can also run this command to see its usage:

ocrmypdf -h

Finally, you can OCR your PDF with the command:

ocrmypdf input.pdf output.pdf

(change input.pdf and output.pdf to the files you want)

Note: Use and implement solution 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply