The LLM Evaluation Framework
-
Updated
Jun 29, 2025 - Python
The LLM Evaluation Framework
Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪
The official evaluation suite and dynamic data release for MixEval.
LangFair is a Python library for conducting use-case level LLM bias and fairness assessments
MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model prompting and evaluation, exact match, F1 Score, PEDANT semantic match, transformer match. Our package also supports prompting OPENAI and Anthropic API.
Develop reliable AI apps
[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models
FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts
Realign is a testing and simulation framework for AI applications.
Multilingual Evaluation Toolkits
Shin Rakuda is a comprehensive framework for evaluating and benchmarking Japanese large language models, offering researchers and developers a flexible toolkit for assessing LLM performance across diverse datasets.
Hackable, simple, llm evals on preference datasets
A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.
Tools for systematic large language model evaluations
Add a description, image, and links to the llm-evaluation-framework topic page so that developers can more easily learn about it.
To associate your repository with the llm-evaluation-framework topic, visit your repo's landing page and select "manage topics."