PDF Text Extraction
Discussions center on tools, techniques, and challenges for extracting text and data from PDFs, including OCR for scanned documents, handling complex layouts and tables, and using LLMs like GPT-4o.
Activity Over Time
Top Contributors
Keywords
Sample Comments
Would this help " rel="nofollow">https://www.pdfdata.io/>?
I recommend running any such PDFs through OCRmyPDF.https://github.com/ocrmypdf/OCRmyPDF
I was expecting a tool to extract text from PDFs, not another LLM pretending to be a reliable OCR.
I'm biased as an employee, but who knows PDFs better than Adobe? Use their PDF text extraction API.
Extracting data from PDF in a reliable way.
PDF is arguably a confusing format for LLMs to read.
Does it work well with scanned PDF? In my experiments it was not giving the correct output.
PDF is a lossy format, slightly better than an image, it can be really hard to extract legible data from it.
PDFs are not necessarily text documents. That's why it's important for the system to specify if it's expecting text, or use OCR.
Today's large vision models like GPT-4o can parse the content heavy papers pretty well (and respect their structures).Yah basically it allows you to send PDFs as image patches into GPT-4o model that workflow can be easily built.Feel free to send me an email [email protected], happy to evaluate your case and try to save that 200K :p