Use vidore benchmark to monitor performances during training #195

QuentinJGMace · 2025-02-14T15:49:02Z

Code to be able to monitor real retrieving metrics on datasets (e.g ViDoRe benchmark) during training.

This feature is deactivated by default and is designed for power users.

To use, simply add in your training config :

vidore_eval_frequency: 200 #frequency of the benchmark eval
eval_dataset_format: "qa" #format of the benchmark datasets (qa or beir)

An example can be found at scripts/configs/qwen2/train_colqwen2_model_eval_vidore.yaml

colpali_engine/trainer/eval_utils.py

tonywu71 · 2025-02-17T13:51:42Z

Recap from our conversation 👋🏼

Let's:

remove the legacy evaluation code
add optional training arg run_vidore_evalutor: if False, do not add the custom callback
add optional training args for vidore_eval_dataset_name and vidore_eval_collection_name (if both are fed, raise error)
add optional training arg to control how often the eval will run (e.g. once every 5 eval steps).

tonywu71 · 2025-02-19T10:09:29Z

@QuentinJGMace vidore-benchmark v5.0.0 has been released, don't forget to bump this dep in pyprojetct.toml 😉

ManuelFay · 2025-04-02T09:27:23Z

@QuentinJGMace @tonywu71 updates ?

QuentinJGMace · 2025-04-04T15:51:13Z

CHANGELOG.md

+### Changed
+
+- Warn about evaluation being different from Vidore, and do not store results to prevent confusion.
+


Not true, update

Copilot

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (2)

colpali_engine/utils/dataset_transformation.py:229

The class is defined as TestSetFactoryBEIR, but the call refers to TestSetFactory, which could lead to a NameError. Please update the class name in the call to match TestSetFactoryBEIR.

ds = TestSetFactory("vidore/tabfquad_test_subsampled")()

colpali_engine/trainer/eval_utils.py:110

The attribute 'self.eval_collection' is not defined in this class. It looks like it should reference an existing attribute, possibly related to the evaluation dataset loader. Please verify and update the reference.

print(f"Error during benchmark evaluation on collection '{self.eval_collection}': {e}")

QuentinJGMace · 2025-04-17T14:57:51Z

colpali_engine/trainer/eval_utils.py

+METRICS_TO_TRACK = [
+    "ndcg_at_1",
+    "ndcg_at_3",
+    "ndcg_at_5",
+    "ndcg_at_10",
+    "ndcg_at_50",
+    "ndcg_at_100",
+    "recall_at_1",
+    "recall_at_3",
+    "recall_at_5",
+    "recall_at_10",
+    "recall_at_50",
+    "recall_at_100",
+    "map_at_1",
+    "map_at_3",
+    "map_at_5",
+    "map_at_10",
+    "map_at_50",
+    "map_at_100",
+]


That is quite a lot to keep track of in a wandb window, especially on multiple datasets.

What are the few ones we should keep ?

I would pick nDCG@k and recall@k for $k \in {1, 5, 10}$ or $k \in {1, 3, 5, 10}$.

QuentinJGMace · 2025-04-17T15:01:15Z

colpali_engine/trainer/eval_utils.py

+            except Exception as e:
+                print(f"Error during benchmark evaluation on collection '{self.eval_collection}': {e}")


Update, not relevant anymore

ManuelFay

Very cool thanks !
Let's wait for the updates in the Dataset code to merge this, so that we adapt it, super cool work thanks !!

ManuelFay · 2025-04-17T15:56:05Z

colpali_engine/trainer/colmodel_training.py

+                    eval_dataset_loader=self.config.eval_dataset_loader,
+                    batch_query=self.config.tr_args.per_device_eval_batch_size,
+                    batch_passage=self.config.tr_args.per_device_eval_batch_size,
+                    batch_score=4,


That's super low, you can probably push this to 256 at least

ManuelFay · 2025-04-17T15:56:32Z

colpali_engine/trainer/colmodel_training.py

+        max_length (int): Maximum sequence length for inputs. Default: 256.
+        run_eval (bool): If True, runs evaluation. Default: True.
+        run_train (bool): If True, runs training. Default: True.
+        vidore_eval_frequency (int): Vidore evaluation frequency, must be a multiple of tr_args.eval_steps.


Maybe an assert to guarantee this ?

ManuelFay · 2025-04-17T15:58:52Z

colpali_engine/trainer/eval_utils.py

+from peft import PeftModel
+from transformers import PreTrainedModel, TrainerControl, TrainerState, TrainingArguments
+from transformers.integrations import WandbCallback
+from vidore_benchmark.evaluation.vidore_evaluators import ViDoReEvaluatorBEIR, ViDoReEvaluatorQA


No circular imports anymore ?

ManuelFay · 2025-04-17T16:03:01Z

colpali_engine/trainer/eval_utils.py

+        print(f"\n=== Running benchmark evaluation at global step {state.global_step} ===")
+
+        # Evaluate on a collection.
+        if self.eval_dataset_loader is not None:


Any reason for this to be none ?

not really, since this eval is deactivated by default I guess that a user wanting to use this should have specified eval datasets

ManuelFay · 2025-04-17T16:04:27Z

colpali_engine/trainer/eval_utils.py

+
+    def on_evaluate(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
+        if state.global_step % self.eval_steps_frequency != 0:
+            self.counter_eval += 1


When do you use this ?

no need, it's an artefact from a previous implementation

tonywu71

Very nice work overall, thanks a ton @QuentinJGMace! A few comments to address but otherwise LTGM :)

tonywu71 · 2025-04-18T07:28:04Z

CHANGELOG.md

@@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/).

 ### Added

+- Add the possibility for a user to evaluate a model on retrieval datasets (e.g ViDoRe benchmark) during its training.


Suggested change

- Add the possibility for a user to evaluate a model on retrieval datasets (e.g ViDoRe benchmark) during its training.

- Add `BenchmarkEvalCallback` to evaluate a model on retrieval datasets (e.g ViDoRe benchmark) during its training and display the metrics on Weight&Biases.

tonywu71 · 2025-04-18T07:29:57Z

colpali_engine/trainer/eval_utils.py

+        dataset_format: str = "beir",
+    ):
+        """
+        Callback to evaluate the model on a collection of datasets during training.


Add mention of Wandb in the docstring.

tonywu71 · 2025-04-18T07:30:43Z

colpali_engine/trainer/eval_utils.py

+                    metrics_collection[test_name] = {k: v for k, v in metrics.items() if k in METRICS_TO_TRACK}
+                print(f"Benchmark metrics for tests datasets at step {state.global_step}:")
+                print(metrics_collection)
+                print("logging metrics to wandb")


Suggested change

print("logging metrics to wandb")

tonywu71 · 2025-04-18T07:31:24Z

colpali_engine/trainer/eval_utils.py

+                    )
+                    metrics_collection[test_name] = {k: v for k, v in metrics.items() if k in METRICS_TO_TRACK}
+                print(f"Benchmark metrics for tests datasets at step {state.global_step}:")
+                print(metrics_collection)


How does this look like? If it's not already formatted (since it's a dict), you can try pprint.pprint if needed!

I actually think we could remove the print entirely, it looks a bit messy when training

tonywu71 · 2025-04-18T07:32:22Z

colpali_engine/utils/dataset_transformation.py

@@ -210,6 +210,21 @@ def __call__(self, *args, **kwargs):
        return dataset


+class TestSetFactoryBEIR:


2 things here:

Need a short docstring here

It's not a factory per se (a factory is a design pattern that provides a way to create objects without specifying the exact class of object that will be created, which is not the case here). I think we should keep the implementation as simple as possible. Here is a recommendation below.

def load_beir_test_dataset(dataset_path: str, split: str = "test") -> Dict[str, Dataset]: return { "corpus": cast(Dataset, load_dataset(dataset_path, name="corpus", split=split)), "queries": cast(Dataset, load_dataset(dataset_path, name="queries", split=split)), "qrels": cast(Dataset, load_dataset(dataset_path, name="qrels", split=split)), }

Wdyt?

tonywu71 · 2025-04-18T07:34:05Z

colpali_engine/utils/dataset_transformation.py

+    def __init__(self, dataset_path):
+        self.dataset_path = dataset_path
+
+    def __call__(self, *args, **kwargs):


2 questions here:

Is there a reason for keeping *args, **kwargs?

Can we expose split: str = "test" in the __call__ args?

Suggested change

def __call__(self, *args, **kwargs):

def __call__(self, *args, **kwargs):

tonywu71 self-requested a review February 17, 2025 08:30

tonywu71 assigned QuentinJGMace Feb 17, 2025

tonywu71 added the enhancement New feature or request label Feb 17, 2025

tonywu71 reviewed Feb 17, 2025

View reviewed changes

colpali_engine/trainer/eval_utils.py Outdated Show resolved Hide resolved

tonywu71 reviewed Feb 17, 2025

View reviewed changes

colpali_engine/trainer/eval_utils.py Outdated Show resolved Hide resolved

tonywu71 mentioned this pull request Feb 17, 2025

Add BEIR support to ColVision trainer #196

Closed

tonywu71 force-pushed the vidore_eval_training branch from b4aba07 to dff8e6f Compare April 4, 2025 14:20

QuentinJGMace commented Apr 4, 2025

View reviewed changes

QuentinJGMace and others added 16 commits April 17, 2025 16:32

first draft wroking eval benchmark

bf100ea

beir support + QoL changes

1361ae0

fix beir support

8f1966d

wandb callback

8285b51

only keep a few metrics

dc84ce4

working wandb

553548a

minor changes

8f5dbb0

put back old functions

3617b77

fix: fix artifact code from merge conflict

4033d4b

refactoring

f53e296

fix

f06ef10

f

400c3ec

doc

9c4d1d0

add example

3c40b4a

f

7819292

update changelog

7b4b1aa

QuentinJGMace force-pushed the vidore_eval_training branch from f85bd9e to 7b4b1aa Compare April 17, 2025 14:51

QuentinJGMace marked this pull request as ready for review April 17, 2025 14:55

QuentinJGMace requested a review from tonywu71 April 17, 2025 14:55

QuentinJGMace requested a review from Copilot April 17, 2025 14:56

Copilot AI reviewed Apr 17, 2025

View reviewed changes

QuentinJGMace commented Apr 17, 2025

View reviewed changes

QuentinJGMace requested review from antonioloison and ManuelFay April 17, 2025 15:01

ManuelFay reviewed Apr 17, 2025

View reviewed changes

tonywu71 requested changes Apr 18, 2025

View reviewed changes

		### Changed

		- Warn about evaluation being different from Vidore, and do not store results to prevent confusion.

		except Exception as e:
		print(f"Error during benchmark evaluation on collection '{self.eval_collection}': {e}")

		@@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/).

		### Added

		- Add the possibility for a user to evaluate a model on retrieval datasets (e.g ViDoRe benchmark) during its training.

	- Add the possibility for a user to evaluate a model on retrieval datasets (e.g ViDoRe benchmark) during its training.
	- Add `BenchmarkEvalCallback` to evaluate a model on retrieval datasets (e.g ViDoRe benchmark) during its training and display the metrics on Weight&Biases.

		@@ -210,6 +210,21 @@ def __call__(self, args, *kwargs):
		return dataset


		class TestSetFactoryBEIR:

	def __call__(self, args, *kwargs):
	def __call__(self, args, *kwargs):

Use vidore benchmark to monitor performances during training #195

Are you sure you want to change the base?

Use vidore benchmark to monitor performances during training #195

Uh oh!

Conversation

QuentinJGMace commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tonywu71 commented Feb 17, 2025

Uh oh!

tonywu71 commented Feb 19, 2025

Uh oh!

ManuelFay commented Apr 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ManuelFay left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tonywu71 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

QuentinJGMace commented Feb 14, 2025 •

edited

Loading