PDF Table Extraction
Cluster focuses on tools, services like AWS Textract, Azure Document Intelligence, and LLMs for extracting structured data such as tables and key-value pairs from semi-structured documents including PDFs, invoices, receipts, and forms.
Activity Over Time
Top Contributors
Keywords
Sample Comments
Extraction from semi-structured documents. Reqs:- No absolute element positions- Able to handle a reasonable amount of skew- Able to handle photographs of documents (e.g. mobile)- Excellent support for repeating groups/elements- Around 95% accuracy without human verification
Looks nice! Do you know if they can do table structuring as well? Something similar to what Amazon Textract does[0].[0]https://docs.aws.amazon.com/textract/latest/dg/how-it-works-...
Not just PDFs with tables. It works on any semi-structured document with key-value pairs like invoices, purchase orders, receipts, tickets, forms, error messages, logs, etc.The "Information Extraction from semistructured and unstructured documents" task is seeing a huge leap, just 3 years ago it was very tedious to train a model to solve a single use case. Now they all work.But if you do make the effort to train a specialised model for a single document type, the narrow model su
Amazon Textract, to get tables, format them with Python as csv then send to your preferred AI model.
For document text/table extraction, nothing beats the quality from the cloud providers. It can get costly but the accuracy is much higher than what you will find using an openai API.
Have you tried Azure AI Document Intelligence?In theory it's exactly this...
Have not used in on your docs but I can say that it definitely works well with forms and forms with tables like a Bill of Lading. It costs extra but you need to turn on table extract (at least in AWS). You then can get a markdown representation of that page include table, you can of course pull out the table itself but unless its standardized you will need the middleman LLM figuring out the exact data/structure you are looking for.
AWS Textract works pretty well for this and is much cheaper than running LLMs.
Give an LLM the text of a PDF document. Ask the model to extract values in the document or in tables. Input the values into a spreadsheet. This is at minimum a task which costs companies around the world Hundreds of Millions of dollars a year.
We use Claude 3.5 sonnet to OCR and structure tabular data from PDFs and it’s virtually flawless… orders of magnitude better than Textract (or pretty much any other LLM).