A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
-
Updated
Jun 14, 2025 - Python
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
[CVPR'25 highlight] RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness
[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
The official code of the paper "Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate".
[CVPR 2024] Situational Awareness Matters in 3D Vision Language Reasoning
Code for ACL 2023 Oral Paper: ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning
The official implementation for the ICCV 2023 paper "Grounded Image Text Matching with Mismatched Relation Reasoning".
Cross-aware Early Fusion with Stage-divided Vision and Language Transformer Encoders for Referring Image Segmentation (Published in IEEE TMM 2023)
Code for ECIR 2023 paper "Dialogue-to-Video Retrieval"
Streamlit App Combining Vision, Language, and Audio AI Models
Multimodal Agentic GenAI Workflow – Seamlessly blends retrieval and generation for intelligent storytelling
Socratic models for multimodal reasoning & image captioning
Explore the rich flavors of Indian desserts with TunedLlavaDelights. Utilizing the in Llava fine-tuning, our project unveils detailed nutritional profiles, taste notes, and optimal consumption times for beloved sweets. Dive into a fusion of AI innovation and culinary tradition
This reporsitory contains all the Homeworks, and Projects from the Deep Learning Course by Prof. Chinmay Hegde, in Spring 2025, at NYU.
Vision Matters explores how simple visual changes can enhance multimodal math reasoning. Join the discussion and contribute to the project! 👩💻👨💻
Add a description, image, and links to the vision-language-learning topic page so that developers can more easily learn about it.
To associate your repository with the vision-language-learning topic, visit your repo's landing page and select "manage topics."