slackbuilds/office/pdfsandwich/README

pdfsandwich (add OCR text layer to scanned PDF files)

pdfsandwich generates "sandwich" OCR PDF files, i.e. PDF files which
contain only images (no text) will be processed by optical character
recognition (OCR) and the text will be added to each page invisibly
"behind" the images. This makes it possible to search for text in
the PDF, and copy text from the PDF.

Notes:

--
The man page explains this, but I'll mention it here: the "sandwich"
PDF (the output) for filename.pdf will be written to filename_ocr.pdf.

--
According to its man page, pdfsandwich can optionally use hocr2pdf.
However, the version of hocr2pdf on SlackBuilds.org (in the
exact-image package) doesn't seem to work correctly with pdfsandwich.
This isn't a real problem, just don't use the -enforcehocr2pdf option
with pdfsandwich.

--
The PDFs created by pdfsandwich are not quite to spec. In mupdf, you
may see "warning: broken xref subsection, proceeding anyway". This
seems to be harmless. If you discover that it isn't harmless, you
can use ghostscript to fix it, thus:

$ gs -dSAFER -dNOPAUSE -sDEVICE=pdfwrite \
     -sOutputFile=pdf_fixed.pdf pdf_ocr.pdf