PDF Table Extraction

Cluster focuses on tools, services like AWS Textract, Azure Document Intelligence, and LLMs for extracting structured data such as tables and key-value pairs from semi-structured documents including PDFs, invoices, receipts, and forms.

➡️ Stable 1.7x AI & Machine Learning

1,610

Comments

Years Active

Top Authors

#5411

Topic ID

Activity Over Time

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

109

2021

102

2022

2023

193

2024

367

2025

419

2026

Top Contributors

constantinum (32) superdocs1 (30) visarga (18) infecto (15) simonw (14)

Keywords

e.g LLM AWS CSV extracttable.com AEC AI HN RAG konfuzio.com documents table extract tables pdf document pdfs data extraction llm

Sample Comments

staticautomatic • Oct 24, 2017 • View on HN

Extraction from semi-structured documents. Reqs:- No absolute element positions- Able to handle a reasonable amount of skew- Able to handle photographs of documents (e.g. mobile)- Excellent support for repeating groups/elements- Around 95% accuracy without human verification

haolez • Jan 3, 2024 • View on HN

Looks nice! Do you know if they can do table structuring as well? Something similar to what Amazon Textract does[0].[0]https://docs.aws.amazon.com/textract/latest/dg/how-it-works-...

visarga • May 15, 2023 • View on HN

Not just PDFs with tables. It works on any semi-structured document with key-value pairs like invoices, purchase orders, receipts, tickets, forms, error messages, logs, etc.The "Information Extraction from semistructured and unstructured documents" task is seeing a huge leap, just 3 years ago it was very tedious to train a model to solve a single use case. Now they all work.But if you do make the effort to train a specialised model for a single document type, the narrow model su

Oras • Apr 13, 2024 • View on HN

Amazon Textract, to get tables, format them with Python as csv then send to your preferred AI model.

infecto • Jun 7, 2024 • View on HN

For document text/table extraction, nothing beats the quality from the cloud providers. It can get costly but the accuracy is much higher than what you will find using an openai API.

Closi • Dec 25, 2023 • View on HN

Have you tried Azure AI Document Intelligence?In theory it's exactly this...

infecto • Nov 19, 2024 • View on HN

Have not used in on your docs but I can say that it definitely works well with forms and forms with tables like a Bill of Lading. It costs extra but you need to turn on table extract (at least in AWS). You then can get a markdown representation of that page include table, you can of course pull out the table itself but unless its standardized you will need the middleman LLM figuring out the exact data/structure you are looking for.

dontlikeyoueith • Mar 6, 2025 • View on HN

AWS Textract works pretty well for this and is much cheaper than running LLMs.

humansareok1 • Feb 23, 2024 • View on HN

Give an LLM the text of a PDF document. Ask the model to extract values in the document or in tables. Input the values into a spreadsheet. This is at minimum a task which costs companies around the world Hundreds of Millions of dollars a year.

jmartin2683 • Feb 8, 2025 • View on HN

We use Claude 3.5 sonnet to OCR and structure tabular data from PDFs and it’s virtually flawless… orders of magnitude better than Textract (or pretty much any other LLM).