PDF Table Extraction

Cluster focuses on tools, services like AWS Textract, Azure Document Intelligence, and LLMs for extracting structured data such as tables and key-value pairs from semi-structured documents including PDFs, invoices, receipts, and forms.

➡️ Stable 1.7x AI & Machine Learning
1,610
Comments
19
Years Active
5
Top Authors
#5411
Topic ID

Activity Over Time

2008
2
2009
2
2010
11
2011
10
2012
13
2013
18
2014
26
2015
25
2016
31
2017
56
2018
70
2019
68
2020
109
2021
102
2022
69
2023
193
2024
367
2025
419
2026
19

Keywords

e.g LLM AWS CSV extracttable.com AEC AI HN RAG konfuzio.com documents table extract tables pdf document pdfs data extraction llm

Sample Comments

staticautomatic Oct 24, 2017 View on HN

Extraction from semi-structured documents. Reqs:- No absolute element positions- Able to handle a reasonable amount of skew- Able to handle photographs of documents (e.g. mobile)- Excellent support for repeating groups/elements- Around 95% accuracy without human verification

haolez Jan 3, 2024 View on HN

Looks nice! Do you know if they can do table structuring as well? Something similar to what Amazon Textract does[0].[0]https://docs.aws.amazon.com/textract/latest/dg/how-it-works-...

visarga May 15, 2023 View on HN

Not just PDFs with tables. It works on any semi-structured document with key-value pairs like invoices, purchase orders, receipts, tickets, forms, error messages, logs, etc.The "Information Extraction from semistructured and unstructured documents" task is seeing a huge leap, just 3 years ago it was very tedious to train a model to solve a single use case. Now they all work.But if you do make the effort to train a specialised model for a single document type, the narrow model su

Oras Apr 13, 2024 View on HN

Amazon Textract, to get tables, format them with Python as csv then send to your preferred AI model.

infecto Jun 7, 2024 View on HN

For document text/table extraction, nothing beats the quality from the cloud providers. It can get costly but the accuracy is much higher than what you will find using an openai API.

Closi Dec 25, 2023 View on HN

Have you tried Azure AI Document Intelligence?In theory it's exactly this...

infecto Nov 19, 2024 View on HN

Have not used in on your docs but I can say that it definitely works well with forms and forms with tables like a Bill of Lading. It costs extra but you need to turn on table extract (at least in AWS). You then can get a markdown representation of that page include table, you can of course pull out the table itself but unless its standardized you will need the middleman LLM figuring out the exact data/structure you are looking for.

AWS Textract works pretty well for this and is much cheaper than running LLMs.

humansareok1 Feb 23, 2024 View on HN

Give an LLM the text of a PDF document. Ask the model to extract values in the document or in tables. Input the values into a spreadsheet. This is at minimum a task which costs companies around the world Hundreds of Millions of dollars a year.

jmartin2683 Feb 8, 2025 View on HN

We use Claude 3.5 sonnet to OCR and structure tabular data from PDFs and it’s virtually flawless… orders of magnitude better than Textract (or pretty much any other LLM).