PDF Text Extraction

Discussions center on tools, techniques, and challenges for extracting text and data from PDFs, including OCR for scanned documents, handling complex layouts and tables, and using LLMs like GPT-4o.

➡️ Stable 1.1x AI & Machine Learning

2,335

Comments

Years Active

Top Authors

#9217

Topic ID

Activity Over Time

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

227

2021

102

2022

188

2023

340

2024

377

2025

558

2026

Top Contributors

simonw (18) throwaway4496 (10) constantinum (10) dredmorbius (8) chiccomagnus (8)

Keywords

LLM pdfdata.io MM EDIT xpdfreader.com PDF.js SDK YYYY man.html JSON pdfs pdf text documents scanned extract ha ha ocr ha ha ha file

Sample Comments

giblet • Aug 28, 2017 • View on HN

Would this help " rel="nofollow">https://www.pdfdata.io/>?

kristofferR • Nov 18, 2023 • View on HN

I recommend running any such PDFs through OCRmyPDF.https://github.com/ocrmypdf/OCRmyPDF

Zardoz84 • Mar 1, 2025 • View on HN

I was expecting a tool to extract text from PDFs, not another LLM pretending to be a reliable OCR.

knicholes • Feb 16, 2025 • View on HN

I'm biased as an employee, but who knows PDFs better than Adobe? Use their PDF text extraction API.

bgia • Aug 27, 2017 • View on HN

Extracting data from PDF in a reliable way.

programmarchy • Oct 23, 2025 • View on HN

PDF is arguably a confusing format for LLMs to read.

vikasr111 • Jul 17, 2023 • View on HN

Does it work well with scanned PDF? In my experiments it was not giving the correct output.

UnFleshedOne • May 17, 2019 • View on HN

PDF is a lossy format, slightly better than an image, it can be really hard to extract legible data from it.

giantg2 • Sep 18, 2023 • View on HN

PDFs are not necessarily text documents. That's why it's important for the system to specify if it's expecting text, or use OCR.

richardmeng • Aug 12, 2024 • View on HN

Today's large vision models like GPT-4o can parse the content heavy papers pretty well (and respect their structures).Yah basically it allows you to send PDFs as image patches into GPT-4o model that workflow can be easily built.Feel free to send me an email [email protected], happy to evaluate your case and try to save that 200K :p