document-processing

TWIX is an open-source data extraction tool that reconstructs structured data from documents at scale, accurately and at low cost, by inferring the shared underlying visual template across documents

document-processing document-data-extraction

Updated May 29, 2025
Python

awslabs / rhubarb

Star

A Python framework for multi-modal document understanding with Amazon Bedrock

multi-modal document-processing generative-ai intelligent-document-processing amazon-bedrock

Updated Jun 19, 2025
Python

iamarunbrahma / pdf-to-markdown

Star

Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

python information-retrieval document-conversion pdf-converter text-extraction pdf-parsing document-processing rag pdf-extraction retrieval-augmented-generation pdf-to-markdown

Updated Nov 22, 2024
Python

parsee-ai / parsee-core

Star

Retrieval of fully structured data made easy. Use LLMs or custom models. Specialized on PDFs and HTML files. Extensive support of tabular data extraction and multimodal queries.

structured-data document-processing multimodal llm

Updated May 22, 2025
Python

jmanhype / DSPy-Multi-Document-Agents

Star

An advanced distributed knowledge fabric for intelligent document processing, featuring multi-document agents, optimized query handling, and semantic understanding.

nlp distributed-systems ai query-optimization knowledge-management document-processing vector-search

Updated Aug 17, 2024
Python

afrozas / proceedings

Star

Semantic extraction from conference proceedings.

semantic conferences spacy document-processing

Updated Jul 26, 2020
Python

MBAigner / PDFSegmenter

Star

This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted as a CSV.

python pdf csv table annotations cluster-analysis document-processing layout-analysis detection-model page-segmentation

Updated Sep 11, 2020
Python

ucbepic / BARGAIN

Star

Low-Cost LLM-Powered Data Processing with Theoretical Guarantees

data ai document-processing llm

Updated May 1, 2025
Python

smart-models / Normalized-Semantic-Chunker

Star

Cutting-edge tool that unlocks the full potential of semantic chunking

semantic rest-api embeddings gpu-acceleration semantic-search text-segmentation document-processing rag llm

Updated Jun 25, 2025
Python

thammuio / doc-genius-ai

Star

DocGenius AI - Generative AI Chatbot for your Documents

machine-learning cloudera data-analytics cml ai-agents document-processing cloudera-machine-learning llm genai retrieval-augmented-generation genai-chatbot enterprise-ai-solutions

Updated Mar 22, 2025
Python

aws-samples / idp-invoice-automation-using-bedrock-data-automation-cdk

Star

Serverless Intelligent Document Processing (IDP) solution for invoice automation using Amazon Bedrock Data Automation. Features automated data extraction, annotation, and processing pipeline built with AWS services and CDK.

python aws lambda s3 sqs idp document-processing aws-cdk event-bridge genai amazon-bedrock amazon-bedrock-data-automation

Updated Jan 15, 2025
Python

felixdittrich92 / docling-OCR-OnnxTR

Sponsor

Star

OnnxTR OCR plugin for Docling

ocr deep-learning text-recognition text-detection document-processing onnx onnxruntime docling onnxtr

Updated Jun 19, 2025
Python

smart-models / Sentences-Chunker

Star

Cutting-edge tool designed to intelligently segment text documents into optimally-sized chunks

nlp docker-compose gpu-acceleration document-processing rag fastapi text-chunking

Updated Jun 9, 2025
Python

abdur75648 / urdu-text-detection

Star

Text line detection for Urdu OCR (UTRNet)

ocr text-detection document-processing urdu-text-detection urdu-ocr utrnet contournet

Updated Oct 8, 2024
Python

jayllfpt / table2html

Star

A Python package that converts table images into HTML format using Object Detection model and OCR.

python opencv object-detection document-processing table-recognition table-detection table-structure-recognition

Updated Dec 3, 2024
Python

martin-papy / qdrant-loader

Star

Enterprise-ready vector database toolkit for building searchable knowledge bases from multiple data sources. Supports multi-project management, automatic ingestion from Confluence/JIRA/Git, intelligent file conversion (PDF/Office/images), and semantic search. Includes MCP server for seamless AI assistant integration in development environments.

python openai developer-tools knowledge-base file-conversion enterprise-ready semantic-search multi-project cli-tool document-processing embbedings git-integration rag jira-integration cursor-ide llm-integration mcp-server confluence-integration

Updated Jun 24, 2025
Python

sylvester-francis / Automated-Document-Compliance-Auditor

Star

A GenAI-powered Flask app that audits documents for GDPR/HIPAA compliance, using regex rules and Anthropic Claude API to suggest remediations.

Updated May 10, 2025
Python

Improve this page

Add a description, image, and links to the document-processing topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the document-processing topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

document-processing

Here are 75 public repositories matching this topic...

ucbepic / docetl

enoch3712 / ExtractThinker

dhlab-epfl / dhSegment

ucbepic / TWIX

awslabs / rhubarb

iamarunbrahma / pdf-to-markdown

parsee-ai / parsee-core

jmanhype / DSPy-Multi-Document-Agents

afrozas / proceedings

MBAigner / PDFSegmenter

ucbepic / BARGAIN

smart-models / Normalized-Semantic-Chunker

thammuio / doc-genius-ai

aws-samples / idp-invoice-automation-using-bedrock-data-automation-cdk

felixdittrich92 / docling-OCR-OnnxTR

smart-models / Sentences-Chunker

abdur75648 / urdu-text-detection

jayllfpt / table2html

martin-papy / qdrant-loader

sylvester-francis / Automated-Document-Compliance-Auditor

Improve this page

Add this topic to your repo