Skip to content

Bug: Searchable pdf comes out improperly rotated even though the images look fine [Exif rotation metadata problem] #16

@ElectricRCAircraftGuy

Description

@ElectricRCAircraftGuy

Scenario:

I take some photos of documents with my phone. I download them. They are properly rotated. I cd into their dir and run pdf2searchablepdf ., which produces file ._searchable.pdf.

The PDF pages are improperly rotated though!

Double-clicking an image in Ubuntu to open it in the Ubuntu Image Viewer shows it is rotated properly, so what's wrong!

Well, it turns out the image contains "Exif orientation metadata" which tesseract is apparently ignoring! Open the image in GIMP and it will show the following:

This image contains Exif orientation metadata. Would you like to rotate the image?

image

So:

  1. Report this as a bug to tesseract.
  2. Do a fix meanwhile which will force a true rotation prior to running tesseract:
    sudo apt install exiftran
    cd path/to/dir_of_images
    exiftran -ai *.jpg
    See my answer here: https://superuser.com/a/1645862/425838.

I should also auto-enhance (whiten) the images with these 2 algorithms in Python in my answer here: https://stackoverflow.com/questions/48268068/how-do-i-do-the-equivalent-of-gimps-colors-auto-white-balance-in-python-fu/67343271#67343271. See also: https://superuser.com/questions/370920/auto-image-enhance-for-ubuntu.

And I should compress them with jpegoptim as I explain in my readme here: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF#image-size-notes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions