Checking whether a PDF file is Searchable
- Written by Matthew
Some PDFs are searchable: others are not. Given a collection of PDF documents, this script sorts them into one of two folders. Documents where the text is searchable are placed in one folder and documents that do not (yet) contain searchable text, are placed in the other.
The download below is a so-called "batch sequence". Batch sequences offer an easy way of carrying out a series of actions on PDF documents in Adobe Acrobat. They are stored in separate text files with the extension .sequ. The Help in your version of Acrobat tells you where they should be copied to and how to run them.
Requirements: Adobe Acrobat (Acrobat Reader is not sufficient).
Someone had a problem with "a huge project" that required PDF files to be OCR-ed to make the text in them searchable. Some of the files however had already been OCR-ed and he wanted to find those file that had not yet been processed.
A PDF document may contain text in either of two ways. If the text is in there as an image, it cannot be searched. Text can be searched only if it is present really as text. OCR is how you add an additional text layer to a PDF that contains the words as an image. Thus, a file that has been OCR-ed will contain words, and one that hasn't will not. Our batch sequence simply uses the word count on each document to divide the collection into two groups of documents. Those with zero words have not yet been OCR-ed and are placed in one folder. Documents with one or more words have already been OCR-ed and are placed in another folder.
(Thanks to Michael J Evering II for observing that we can stop checking after finding the first page which contains words.)