pdf-extraction

Star

Here are 307 public repositories matching this topic...

opendataloader-project / opendataloader-pdf

Star

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

Updated Jun 19, 2026
Java

kreuzberg-dev / kreuzberg

Star

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

Updated Jun 21, 2026
Rust

Zipstack / unstract

Star

LLM-Driven Extraction of Unstructured Data — Built for API Deployments & ETL Pipeline Workflows

ocr data-engineering idp ai-agents structured-output pdf-extraction document-ai llm prompt-engineering generative-ai mcp-server json-extraction

Updated Jun 21, 2026
Python

firecrawl / pdf-inspector

Star

Fast Rust library for PDF inspection, classification, and text extraction. Intelligently detects scanned vs text-based PDFs to enable smart routing decisions.

nodejs python markdown rust pdf text-extraction pdf-parser pdf-extraction ocr-routing pdf-classification

Updated Jun 21, 2026
Rust

24eme / signaturepdf

Star

Free open-source web software for signing PDF (alone or with others) and also organize pages, edit metadata and compress pdf

php pdf js signature pdf-manipulation pdf-merge pdf-format pdf-rotate pdf-merger pdf-meta-editor pdf-tools pdf-signature pdf-compression pdf-editor pdf-sign pdf-extraction pdf-signer pdf-metadata pdf-compressor

Updated Jun 20, 2026
JavaScript

pytr-org / pytr

Star

Use TradeRepublic in terminal and mass download all documents

portfolio finance terminal-app portfolio-performance pdf-extraction traderepublic-statements traderepublic

Updated Jun 17, 2026
Python

ArtifexSoftware / mupdf.js

Star

JavaScript bindings for MuPDF

javascript pdf typescript wasm mupdf pdf-viewer pdf-extraction

Updated May 5, 2026

ExtractPDF4J / ExtractPDF4J

Sponsor

Star

Java PDF table extraction & OCR library. Extract structured tables from text-based and scanned PDFs using stream, lattice (OpenCV-style grid detection), and hybrid parsing.

java cli ocr maven pdf-document pdf-extractor ocr-recognition document-processing pdf-processor pdf-document-processor pdf-extraction java17

Updated Mar 15, 2026
Java

aiptimizer / TurboOCR

Star

Fast GPU OCR server. 270 img/s on FUNSD. TensorRT FP16, PP-OCRv5, HTTP + gRPC.

ocr grpc nvidia text-recognition text-detection inference-server fp16 tensorrt rag fastapi pdf-extraction paddleocr easyocr document-ai document-parsing qwen-vl gpu-ocr

Updated Jun 11, 2026
C++

mateogon / pdf-narrator

Star

Convert your PDFs and EPUBs into audiobooks effortlessly. Features intelligent text extraction, customizable text-to-speech settings, and efficient processing for low-resource systems.

pdf text-to-speech audiobook tts epub low-resource pdf-extraction pdf-to-audiobook immersive-reading kokoro-tts audiobook-generator pdf-audiobook

Updated Feb 26, 2026
Python

iamarunbrahma / pdf-to-markdown

Star

Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

python information-retrieval document-conversion pdf-converter text-extraction pdf-parsing document-processing rag pdf-extraction retrieval-augmented-generation pdf-to-markdown

Updated Nov 22, 2024
Python

appautomaton / document-SKILLs

Star

Claude Code and Codex SKILLs for PDF, Excel, Word, and PowerPoint manipulation — extraction, forms, formulas, tracked changes, adapted from Anthropic skills.

excel docx pptx codex ai-agents document-processing pdf-extraction agent-skills claude-code claude-skills

Updated Jun 18, 2026
Python

NameetP / pdfmux

Star

PDF extraction that checks its own work. #2 reading order accuracy — zero AI, zero GPU, zero cost.

python pdf ocr mcp self-healing structured-extraction rag pdf-to-json pdf-extraction ai-agent llm document-parsing pdf-to-markdown docling opendataloader

Updated Jun 10, 2026
Python

jztan / pdf-mcp

Star

An MCP server that lets Claude Code and other AI agents work through large PDFs without overflowing their context — search by meaning or keyword, read only the pages that matter, and cleanly pull out tables, images, and scanned text, even from multi-column and Japanese layouts.

python pdf ocr ai mcp opencode cjk copilot semantic-search table-extraction claude document-processing pymupdf pdf-extraction llm agentic-ai model-context-protocol mcp-server codex-cli

Updated Jun 20, 2026
Python

heleninsights-dot / phd-deepread-workflow

Star

A professinal CLI workflow for PhD students to extract, analyze, and visualize academic papers into structured Markdown and Obsidian Canvas.

python pdf workflow research academic obsidian literature-review pdf-extraction

Updated May 26, 2026
Python

pcschreiber1 / PDF_Extraction-Translation

Star

Translate many large PDF Reports for free using Python.

python pdf-extraction pdf-translation

Updated Dec 31, 2022
Jupyter Notebook

YounesBensafia / arxiv-reader-mcp

Star

Want to search arXiv papers, fetch metadata, and extract full-text PDFs without leaving your editor? This MCP server connects any MCP-compatible client (Claude Code, etc.) directly to arXiv.

python ai mcp arxiv-api research-papers pdf-extraction arxiv-papers model-context-protocol mcp-server

Updated Jun 7, 2026
Python

wszqkzqk / qt-web-extractor

Star

Web content extraction engine backed by Qt WebEngine.

mcp chromium web-scraping qtwebengine content-extraction headless-browser pdf-extraction pyside6 open-webui mcp-server

Updated May 19, 2026
Python

clark-labs-inc / pdfsink-rs

Star

Fast pure-Rust PDF extraction library and CLI by Clark Labs Inc. — 10–50x faster than pdfplumber for text, word, table, layout, image, and metadata extraction.

rust pdf text-extraction rust-library pdf-to-text rust-crate table-extraction pdf-parser document-processing layout-analysis pdf-to-json pdf-extraction pdfplumber document-ai clark-labs

Updated Jun 6, 2026
Rust

retsef / rpdfium

Star

Ruby binding for Pdfium

ruby pdf pdf-parser pdfium pdf-extraction document-parsing

Updated Jun 18, 2026
Ruby

Improve this page

Add a description, image, and links to the pdf-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the pdf-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdf-extraction

Here are 307 public repositories matching this topic...

opendataloader-project / opendataloader-pdf

kreuzberg-dev / kreuzberg

Zipstack / unstract

firecrawl / pdf-inspector

24eme / signaturepdf

pytr-org / pytr

ArtifexSoftware / mupdf.js

ExtractPDF4J / ExtractPDF4J

aiptimizer / TurboOCR

mateogon / pdf-narrator

iamarunbrahma / pdf-to-markdown

appautomaton / document-SKILLs

NameetP / pdfmux

jztan / pdf-mcp

heleninsights-dot / phd-deepread-workflow

pcschreiber1 / PDF_Extraction-Translation

YounesBensafia / arxiv-reader-mcp

wszqkzqk / qt-web-extractor

clark-labs-inc / pdfsink-rs

retsef / rpdfium

Improve this page

Add this topic to your repo