PDF2TXT-OCR enhances scanned PDF files by adding an OCR (Optical Character Recognition) text layer, making them searchable and editable.
- Converts PDFs into searchable PDF/A files.
- Accurately places OCR text beneath images for easy copy-pasting.
- Maintains the original image resolution while optimizing file size.
- Deskews and cleans images for improved OCR accuracy (if requested).
- Validates both input and output files for compatibility.
- Distributes tasks across all CPU cores for faster processing.
- Supports Tesseract OCR for over 100 languages.
- Handles files with thousands of pages efficiently.
- Built for privacy, ensuring your data remains secure.
After exploring existing tools, common issues like misplaced OCR text, altered image resolution, oversized output files, and lack of PDF/A support led to the creation of PDF2TXT-OCR. This tool addresses these gaps with precision and reliability.
PDF2TXT-OCR is compatible with Linux, Windows, macOS, and FreeBSD. Docker images for x64 and ARM are also available.
| Platform | Command |
|---|---|
| Debian/Ubuntu | apt install ocrmypdf |
| Fedora | dnf install ocrmypdf |
| macOS (Homebrew) | brew install ocrmypdf |
| macOS (MacPorts) | port install ocrmypdf |
| FreeBSD | pkg install py-ocrmypdf |
| Snap Package | snap install ocrmypdf |
For other operating systems, refer to the detailed installation documentation.
PDF2TXT-OCR is a command-line tool. Here’s a quick example:
ocrmypdf input_scanned.pdf output_searchable.pdf-l eng+fra- Specify languages for OCR.--deskew- Straightens crooked pages.--rotate-pages- Corrects misaligned pages.--output-type pdfa- Produces PDF/A files by default.
For a full list of options, use:
ocrmypdf --helpTo use OCR for different languages, install Tesseract language packs. For instance:
# Debian/Ubuntu
apt-get install tesseract-ocr-chi-sim # Install Chinese (Simplified)
# macOS (Homebrew)
brew install tesseract-langSpecify multiple languages using their ISO 639-3 codes, e.g., -l eng+fra.
- Convert an image into a searchable PDF:
ocrmypdf input.jpg output.pdf
- Add OCR to an existing PDF:
ocrmypdf input.pdf output.pdf
- Optimize multilingual PDFs:
ocrmypdf -l eng+spa input.pdf output.pdf
Check out the full documentation for additional examples.
- Python (compatible versions listed in PyPI).
- Tesseract OCR (v4.1.1 or higher).
- Ghostscript for PDF processing.
-
Install: Make sure PDF2TXT-OCR is installed on your system.
-
Basic Command:
ocrmypdf input.pdf output.pdf
-
Optional Features:
- Straighten pages:
--deskew - Fix alignment:
--rotate-pages - Set languages:
-l eng+fra - PDF/A output:
--output-type pdfa
- Straighten pages:
-
Examples:
- Convert image to PDF:
ocrmypdf input.jpg output.pdf
- Add OCR to multilingual PDF:
ocrmypdf -l eng+spa input.pdf output.pdf
- Convert image to PDF:
-
Help: For all options, run:
ocrmypdf --help
PDF2TXT-OCR has been featured in leading publications, including:
- Medium: Going Paperless with OCRmyPDF
- Linux Links: Excellent Utilities: OCRmyPDF
- c’t Magazine: Detailed overview in Germany’s top IT magazine.
For feature development, consulting, or integrating PDF2TXT-OCR into larger systems, please get in touch. Support from companies and users helps improve the project.
PDF2TXT-OCR is licensed under the Mozilla Public License 2.0. Other components may use different licenses, as detailed in the source code.
This software is provided "AS IS" without warranties of any kind, either express or implied.
For further details, visit our documentation or report issues on GitHub.