pdf-tucano

A standalone PDF-to-Markdown microservice for converting PDF documents into Markdown. Upload a PDF, poll for status, and retrieve Markdown output once conversion completes. Jobs are persisted in PostgreSQL and PDFs are stored on disk, while conversion relies on OpenRouter for OCR/LLM transcription.

Features

Asynchronous job queue with status and result endpoints
Conversion workflow: pdf2image → PNG → OpenRouter → Markdown
Configurable DPI, scaling width, and concurrency via environment variables
PostgreSQL persistence with per-page tracking and filesystem PDF storage
Lightweight admin UI at /admin for monitoring job status and results
Container image (pdf-tucano) ready for deployment

Quickstart

Set environment variables

Copy .env.example to .env and fill in your OpenRouter key, or export values manually:

OPENROUTER_API_KEY=your-openrouter-api-key

Run with Docker Compose (example)

services:
  postgres:
    image: postgres:17
    restart: unless-stopped
    environment:
      POSTGRES_DB: pdf_tucano
      POSTGRES_USER: postgresuser
      POSTGRES_PASSWORD: postgresPASS123
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgresuser -d pdf_tucano"]
      interval: 5s
      timeout: 5s
      retries: 5
    volumes:
      - pgdata:/var/lib/postgresql/data

  pdf-tucano:
    build: .
    restart: unless-stopped    
    environment:
      DATABASE_URL: postgresql+psycopg2://postgresuser:postgresPASS123@postgres:5432/pdf_tucano
      OPENROUTER_MODEL: google/gemini-2.0-flash-001
      PAGE_IMAGE_WIDTH: 1500
      PAGE_IMAGE_DPI: 300
      MAX_CONCURRENT_PAGES: 8
      OPENROUTER_API_KEY: ${OPENROUTER_API_KEY}
      LOG_LEVEL: DEBUG
    depends_on:
      postgres:
        condition: service_healthy
    ports:
      - "8000:8000"
    volumes:
      - pdfdata:/data

volumes:
  pgdata:
  pdfdata:

Note: The PostgreSQL service is only reachable from other containers on the Compose network (e.g., via hostname postgres). If you previously created the pgdata volume with a different Postgres major version, remove it before starting (docker volume rm pdf-tucano_pgdata) or Docker will report an incompatible data-directory error.

API usage

# Submit a PDF (async)
curl -F "file=@document.pdf" http://localhost:8000/jobs

# Poll status
curl http://localhost:8000/jobs/<job_id>

# Retrieve markdown when complete
curl http://localhost:8000/jobs/<job_id>/result.txt

# Optional: inspect jobs in the browser
# Visit http://localhost:8000/admin

Configuration

Variable	Default	Description
`DATABASE_URL`	–	SQLAlchemy connection string for PostgreSQL
`OPENROUTER_API_KEY`	–	API key for OpenRouter
`OPENROUTER_MODEL`	`google/gemini-2.0-flash-001`	Model used for image-to-markdown conversion. This default model works great and it's cheap.
`PAGE_IMAGE_WIDTH`	`1500`	Target image width before sending to OpenRouter
`PAGE_IMAGE_DPI`	`300`	PDF rasterization DPI
`MAX_CONCURRENT_PAGES`	`8`	Page-level concurrency per job
`JOB_POLL_INTERVAL_SECONDS`	`2.0`	Delay between jobs once processing completes
`JOB_IDLE_SLEEP_SECONDS`	`5.0`	Sleep duration when no queued jobs are found
`STORAGE_ROOT`	`/data`	Root directory for storing uploaded PDFs
`PDF_SUBDIR`	`pdfs`	Subdirectory (under storage root) for PDFs

The included docker-compose.yml mirrors these processing defaults and enables debug logging (LOG_LEVEL=DEBUG).

Development

pip install -r requirements.txt
uvicorn pdf_tucano.app:app --reload

The worker starts automatically with the FastAPI application.

Notes

Poppler must be installed (handled in the Docker image) for pdf2image.
To reset all data, clear the jobs and pages tables and remove stored PDFs from the volume.
The service returns HTTP 409 if you request results before a job is complete.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
pdf_tucano		pdf_tucano
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdf-tucano

Features

Quickstart

Configuration

Development

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pdf-tucano

Features

Quickstart

Configuration

Development

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages