PDF2TXT-OCR: Add OCR Text Layer to PDFs

PDF2TXT-OCR enhances scanned PDF files by adding an OCR (Optical Character Recognition) text layer, making them searchable and editable.

Features at a Glance

Converts PDFs into searchable PDF/A files.
Accurately places OCR text beneath images for easy copy-pasting.
Maintains the original image resolution while optimizing file size.
Deskews and cleans images for improved OCR accuracy (if requested).
Validates both input and output files for compatibility.
Distributes tasks across all CPU cores for faster processing.
Supports Tesseract OCR for over 100 languages.
Handles files with thousands of pages efficiently.
Built for privacy, ensuring your data remains secure.

Why Choose PDF2TXT-OCR?

After exploring existing tools, common issues like misplaced OCR text, altered image resolution, oversized output files, and lack of PDF/A support led to the creation of PDF2TXT-OCR. This tool addresses these gaps with precision and reliability.

Installation Guide

PDF2TXT-OCR is compatible with Linux, Windows, macOS, and FreeBSD. Docker images for x64 and ARM are also available.

Quick Installation

Platform	Command
Debian/Ubuntu	`apt install ocrmypdf`
Fedora	`dnf install ocrmypdf`
macOS (Homebrew)	`brew install ocrmypdf`
macOS (MacPorts)	`port install ocrmypdf`
FreeBSD	`pkg install py-ocrmypdf`
Snap Package	`snap install ocrmypdf`

For other operating systems, refer to the detailed installation documentation.

Getting Started

PDF2TXT-OCR is a command-line tool. Here’s a quick example:

ocrmypdf input_scanned.pdf output_searchable.pdf

Key Options:

-l eng+fra - Specify languages for OCR.
--deskew - Straightens crooked pages.
--rotate-pages - Corrects misaligned pages.
--output-type pdfa - Produces PDF/A files by default.

For a full list of options, use:

ocrmypdf --help

Multilingual OCR

To use OCR for different languages, install Tesseract language packs. For instance:

# Debian/Ubuntu
apt-get install tesseract-ocr-chi-sim  # Install Chinese (Simplified)

# macOS (Homebrew)
brew install tesseract-lang

Specify multiple languages using their ISO 639-3 codes, e.g., -l eng+fra.

Advanced Features

Convert an image into a searchable PDF:
```
ocrmypdf input.jpg output.pdf
```
Add OCR to an existing PDF:
```
ocrmypdf input.pdf output.pdf
```

Optimize multilingual PDFs:

ocrmypdf -l eng+spa input.pdf output.pdf

Check out the full documentation for additional examples.

Requirements

Python (compatible versions listed in PyPI).
Tesseract OCR (v4.1.1 or higher).
Ghostscript for PDF processing.

How to Use PDF2TXT-OCR

Install: Make sure PDF2TXT-OCR is installed on your system.
Basic Command:
```
ocrmypdf input.pdf output.pdf
```
Optional Features:
- Straighten pages: --deskew
- Fix alignment: --rotate-pages
- Set languages: -l eng+fra
- PDF/A output: --output-type pdfa

Examples:

Convert image to PDF:
```
ocrmypdf input.jpg output.pdf
```

Add OCR to multilingual PDF:

ocrmypdf -l eng+spa input.pdf output.pdf

Help: For all options, run:
```
ocrmypdf --help
```

Media and Reviews

PDF2TXT-OCR has been featured in leading publications, including:

Medium: Going Paperless with OCRmyPDF
Linux Links: Excellent Utilities: OCRmyPDF
c’t Magazine: Detailed overview in Germany’s top IT magazine.

Business Inquiries

For feature development, consulting, or integrating PDF2TXT-OCR into larger systems, please get in touch. Support from companies and users helps improve the project.

License

PDF2TXT-OCR is licensed under the Mozilla Public License 2.0. Other components may use different licenses, as detailed in the source code.

Disclaimer

This software is provided "AS IS" without warranties of any kind, either express or implied.

For further details, visit our documentation or report issues on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.docker		.docker
.reuse		.reuse
LICENSES		LICENSES
docs		docs
misc		misc
src/ocrmypdf		src/ocrmypdf
tests		tests
.dockerignore		.dockerignore
.git_archival.txt		.git_archival.txt
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
snapcraft.yaml		snapcraft.yaml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF2TXT-OCR: Add OCR Text Layer to PDFs

Features at a Glance

Why Choose PDF2TXT-OCR?

Installation Guide

Quick Installation

Getting Started

Key Options:

Multilingual OCR

Advanced Features

Requirements

How to Use PDF2TXT-OCR

Media and Reviews

Business Inquiries

License

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF2TXT-OCR: Add OCR Text Layer to PDFs

Features at a Glance

Why Choose PDF2TXT-OCR?

Installation Guide

Quick Installation

Getting Started

Key Options:

Multilingual OCR

Advanced Features

Requirements

How to Use PDF2TXT-OCR

Media and Reviews

Business Inquiries

License

Disclaimer

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages