37 lines
1.5 KiB
Plaintext
37 lines
1.5 KiB
Plaintext
Pdfminer.six is a tool for extracting information from PDF documents. It
|
||
focuses on getting and analyzing text data. Pdfminer.six extracts the
|
||
text from a page directly from the sourcecode of the PDF. It can also be
|
||
used to get the exact location, font or color of the text.
|
||
|
||
It is built in a modular way such that each component of pdfminer.six
|
||
can be replaced easily. You can implement your own interpreter or
|
||
rendering device that uses the power of pdfminer.six for other purposes
|
||
than text analysis.
|
||
|
||
Features:
|
||
|
||
* Written entirely in Python.
|
||
* Parse, analyze, and convert PDF documents.
|
||
* Extract content as text, images, html or hOCR.
|
||
* PDF-1.7 specification support. (well, almost).
|
||
* CJK languages and vertical writing scripts support.
|
||
* Various font types (Type1, TrueType, Type3, and CID) support.
|
||
* Support for extracting images (JPG, JBIG2, Bitmaps).
|
||
* Support for various compressions (ASCIIHexDecode, ASCII85Decode,
|
||
LZWDecode, FlateDecode, RunLengthDecode, CCITTFaxDecode)
|
||
* Support for RC4 and AES encryption.
|
||
* Support for AcroForm interactive form extraction.
|
||
* Table of contents extraction.
|
||
* Tagged contents extraction.
|
||
* Automatic layout analysis.
|
||
|
||
Pdfminer.six comes with two handy tools: pdf2txt.py and dumppdf.py.
|
||
|
||
The pdf2txt.py tool extracts all the text from a PDF. It uses layout
|
||
analysis with sensible defaults to order and group the text in a
|
||
sensible way.
|
||
|
||
The dumppdf.py tool can be used to extract the internal structure from a
|
||
PDF. This tool is primarily for debugging purposes, but that can be
|
||
useful to anybody working with PDF’s.
|