Skip to content

ENGINEER-MUHAMMAD-SHAHZAIB/PDF2TXT-OCR

Repository files navigation

PDF2TXT-OCR: Add OCR Text Layer to PDFs

Build Status PyPI version Homebrew version ReadTheDocs Supported Python versions

PDF2TXT-OCR enhances scanned PDF files by adding an OCR (Optical Character Recognition) text layer, making them searchable and editable.

Features at a Glance

  • Converts PDFs into searchable PDF/A files.
  • Accurately places OCR text beneath images for easy copy-pasting.
  • Maintains the original image resolution while optimizing file size.
  • Deskews and cleans images for improved OCR accuracy (if requested).
  • Validates both input and output files for compatibility.
  • Distributes tasks across all CPU cores for faster processing.
  • Supports Tesseract OCR for over 100 languages.
  • Handles files with thousands of pages efficiently.
  • Built for privacy, ensuring your data remains secure.

Demo


Why Choose PDF2TXT-OCR?

After exploring existing tools, common issues like misplaced OCR text, altered image resolution, oversized output files, and lack of PDF/A support led to the creation of PDF2TXT-OCR. This tool addresses these gaps with precision and reliability.


Installation Guide

PDF2TXT-OCR is compatible with Linux, Windows, macOS, and FreeBSD. Docker images for x64 and ARM are also available.

Quick Installation

Platform Command
Debian/Ubuntu apt install ocrmypdf
Fedora dnf install ocrmypdf
macOS (Homebrew) brew install ocrmypdf
macOS (MacPorts) port install ocrmypdf
FreeBSD pkg install py-ocrmypdf
Snap Package snap install ocrmypdf

For other operating systems, refer to the detailed installation documentation.


Getting Started

PDF2TXT-OCR is a command-line tool. Here’s a quick example:

ocrmypdf input_scanned.pdf output_searchable.pdf

Key Options:

  • -l eng+fra - Specify languages for OCR.
  • --deskew - Straightens crooked pages.
  • --rotate-pages - Corrects misaligned pages.
  • --output-type pdfa - Produces PDF/A files by default.

For a full list of options, use:

ocrmypdf --help

Multilingual OCR

To use OCR for different languages, install Tesseract language packs. For instance:

# Debian/Ubuntu
apt-get install tesseract-ocr-chi-sim  # Install Chinese (Simplified)

# macOS (Homebrew)
brew install tesseract-lang

Specify multiple languages using their ISO 639-3 codes, e.g., -l eng+fra.


Advanced Features

  • Convert an image into a searchable PDF:
    ocrmypdf input.jpg output.pdf
  • Add OCR to an existing PDF:
    ocrmypdf input.pdf output.pdf
  • Optimize multilingual PDFs:
    ocrmypdf -l eng+spa input.pdf output.pdf

Check out the full documentation for additional examples.


Requirements

  • Python (compatible versions listed in PyPI).
  • Tesseract OCR (v4.1.1 or higher).
  • Ghostscript for PDF processing.

How to Use PDF2TXT-OCR

  1. Install: Make sure PDF2TXT-OCR is installed on your system.

  2. Basic Command:

    ocrmypdf input.pdf output.pdf
  3. Optional Features:

    • Straighten pages: --deskew
    • Fix alignment: --rotate-pages
    • Set languages: -l eng+fra
    • PDF/A output: --output-type pdfa
  4. Examples:

    • Convert image to PDF:
      ocrmypdf input.jpg output.pdf
    • Add OCR to multilingual PDF:
      ocrmypdf -l eng+spa input.pdf output.pdf
  5. Help: For all options, run:

    ocrmypdf --help

Media and Reviews

PDF2TXT-OCR has been featured in leading publications, including:


Business Inquiries

For feature development, consulting, or integrating PDF2TXT-OCR into larger systems, please get in touch. Support from companies and users helps improve the project.


License

PDF2TXT-OCR is licensed under the Mozilla Public License 2.0. Other components may use different licenses, as detailed in the source code.


Disclaimer

This software is provided "AS IS" without warranties of any kind, either express or implied.

For further details, visit our documentation or report issues on GitHub.


About

PDF2TXT-OCR is a powerful tool that adds an OCR text layer to scanned PDFs, making them searchable and editable. It supports multiple languages and ensures high-quality, searchable PDF/A output while preserving original image resolution.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors