A system for agentic LLM-powered data processing and ETL
-
Updated
Jun 24, 2025 - Python
A system for agentic LLM-powered data processing and ETL
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
Generic framework for historical document processing
TWIX is an open-source data extraction tool that reconstructs structured data from documents at scale, accurately and at low cost, by inferring the shared underlying visual template across documents
A Python framework for multi-modal document understanding with Amazon Bedrock
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.
Retrieval of fully structured data made easy. Use LLMs or custom models. Specialized on PDFs and HTML files. Extensive support of tabular data extraction and multimodal queries.
An advanced distributed knowledge fabric for intelligent document processing, featuring multi-document agents, optimized query handling, and semantic understanding.
Semantic extraction from conference proceedings.
This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted as a CSV.
Low-Cost LLM-Powered Data Processing with Theoretical Guarantees
Cutting-edge tool that unlocks the full potential of semantic chunking
DocGenius AI - Generative AI Chatbot for your Documents
Serverless Intelligent Document Processing (IDP) solution for invoice automation using Amazon Bedrock Data Automation. Features automated data extraction, annotation, and processing pipeline built with AWS services and CDK.
OnnxTR OCR plugin for Docling
Cutting-edge tool designed to intelligently segment text documents into optimally-sized chunks
Text line detection for Urdu OCR (UTRNet)
A Python package that converts table images into HTML format using Object Detection model and OCR.
Enterprise-ready vector database toolkit for building searchable knowledge bases from multiple data sources. Supports multi-project management, automatic ingestion from Confluence/JIRA/Git, intelligent file conversion (PDF/Office/images), and semantic search. Includes MCP server for seamless AI assistant integration in development environments.
A GenAI-powered Flask app that audits documents for GDPR/HIPAA compliance, using regex rules and Anthropic Claude API to suggest remediations.
Add a description, image, and links to the document-processing topic page so that developers can more easily learn about it.
To associate your repository with the document-processing topic, visit your repo's landing page and select "manage topics."