30 lines
1.1 KiB
Plaintext
30 lines
1.1 KiB
Plaintext
pdfsandwich (add OCR text layer to scanned PDF files)
|
|
|
|
pdfsandwich generates "sandwich" OCR PDF files, i.e. PDF files which
|
|
contain only images (no text) will be processed by optical character
|
|
recognition (OCR) and the text will be added to each page invisibly
|
|
"behind" the images. This makes it possible to search for text in
|
|
the PDF, and copy text from the PDF.
|
|
|
|
Notes:
|
|
|
|
--
|
|
The man page explains this, but I'll mention it here: the "sandwich"
|
|
PDF (the output) for filename.pdf will be written to filename_ocr.pdf.
|
|
|
|
--
|
|
According to its man page, pdfsandwich can optionally use hocr2pdf.
|
|
However, the version of hocr2pdf on SlackBuilds.org (in the
|
|
exact-image package) doesn't seem to work correctly with pdfsandwich.
|
|
This isn't a real problem, just don't use the -enforcehocr2pdf option
|
|
with pdfsandwich.
|
|
|
|
--
|
|
The PDFs created by pdfsandwich are not quite to spec. In mupdf, you
|
|
may see "warning: broken xref subsection, proceeding anyway". This
|
|
seems to be harmless. If you discover that it isn't harmless, you
|
|
can use ghostscript to fix it, thus:
|
|
|
|
$ gs -dSAFER -dNOPAUSE -sDEVICE=pdfwrite \
|
|
-sOutputFile=pdf_fixed.pdf pdf_ocr.pdf
|