PDF Text Extraction

Discussions center on tools, techniques, and challenges for extracting text and data from PDFs, including OCR for scanned documents, handling complex layouts and tables, and using LLMs like GPT-4o.

➡️ Stable 1.1x AI & Machine Learning
2,335
Comments
19
Years Active
5
Top Authors
#9217
Topic ID

Activity Over Time

2008
4
2009
20
2010
26
2011
29
2012
25
2013
47
2014
44
2015
38
2016
50
2017
96
2018
62
2019
77
2020
227
2021
102
2022
188
2023
340
2024
377
2025
558
2026
25

Keywords

LLM pdfdata.io MM EDIT xpdfreader.com PDF.js SDK YYYY man.html JSON pdfs pdf text documents scanned extract ha ha ocr ha ha ha file

Sample Comments

giblet Aug 28, 2017 View on HN

Would this help " rel="nofollow">https://www.pdfdata.io/>?

kristofferR Nov 18, 2023 View on HN

I recommend running any such PDFs through OCRmyPDF.https://github.com/ocrmypdf/OCRmyPDF

Zardoz84 Mar 1, 2025 View on HN

I was expecting a tool to extract text from PDFs, not another LLM pretending to be a reliable OCR.

knicholes Feb 16, 2025 View on HN

I'm biased as an employee, but who knows PDFs better than Adobe? Use their PDF text extraction API.

bgia Aug 27, 2017 View on HN

Extracting data from PDF in a reliable way.

programmarchy Oct 23, 2025 View on HN

PDF is arguably a confusing format for LLMs to read.

vikasr111 Jul 17, 2023 View on HN

Does it work well with scanned PDF? In my experiments it was not giving the correct output.

UnFleshedOne May 17, 2019 View on HN

PDF is a lossy format, slightly better than an image, it can be really hard to extract legible data from it.

giantg2 Sep 18, 2023 View on HN

PDFs are not necessarily text documents. That's why it's important for the system to specify if it's expecting text, or use OCR.

richardmeng Aug 12, 2024 View on HN

Today's large vision models like GPT-4o can parse the content heavy papers pretty well (and respect their structures).Yah basically it allows you to send PDFs as image patches into GPT-4o model that workflow can be easily built.Feel free to send me an email [email protected], happy to evaluate your case and try to save that 200K :p