MTEB: Massive Text Embedding Benchmark
-
Updated
Jun 21, 2025 - Python
MTEB: Massive Text Embedding Benchmark
EMNLP 2023 Papers: Explore cutting-edge research from EMNLP 2023, the premier conference for advancing empirical methods in natural language processing. Stay updated on the latest in machine learning, deep learning, and natural language processing with code included. ⭐ support NLP!
Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023
This repo supports various cross-lingual transfer learning & multilingual NLP models.
This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Summarization for 1,500+ Language Pairs" published in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL’23), July 9-14, 2023.
TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages
Generate synthetic labeled data for extremely low-resource languages using bilingual lexicons.
🔍 Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment
On Bilingual Lexicon Induction with Large Language Models (EMNLP 2023). Keywords: Bilingual Lexicon Induction, Word Translation, Large Language Models, LLMs.
[EMNLP 2022] Discovering Language-neutral Sub-networks in Multilingual Language Models.
MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic Parsing
MaLA-500: Massive Language Adaptation of Large Language Models
M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis
[EMNLP 2023 - Findings] Breaking the Language Barrier: Improving Cross-Lingual Reasoning with Structured Self-Attention
Self-Augmented In-Context Learning for Unsupervised Word Translation (ACL 2024). Keywords: Bilingual Lexicon Induction, Word Translation, Large Language Models, LLMs.
Official Repository for Cross-lingual Back-Parsing: Utterance Synthesis from Meaning Representation for Zero-Resource Semantic Parsing (EMNLP 2024)
ConLID: Supervised Contrastive Learning for Low-Resource Language Identification [arXiv - 2025]
The official code and data for the ACL 2024 Findings paper "Bilingual Rhetorical Structure Parsing with Large Parallel Annotations".
This Python package is designed for tokenizing sentences in over 40 languages. It serves as a wrapper around various open-source libraries. The package was created to support our work XL-HeadTags. To use it, simply provide the word and its corresponding language to the stemmer, and it will return the stemmed version of the word.
This Python package is used for calculating ROUGE scores and supports over 100 languages by utilizing a multilingual BPE tokenizer. It leverages the mBERT tokenizer and was developed to support our work XL-HeadTags.
Add a description, image, and links to the multilingual-nlp topic page so that developers can more easily learn about it.
To associate your repository with the multilingual-nlp topic, visit your repo's landing page and select "manage topics."