Skip to content

Use vidore benchmark to monitor performances during training #195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

QuentinJGMace
Copy link
Collaborator

@QuentinJGMace QuentinJGMace commented Feb 14, 2025

Code to be able to monitor real retrieving metrics on datasets (e.g ViDoRe benchmark) during training.

This feature is deactivated by default and is designed for power users.

To use, simply add in your training config :

vidore_eval_frequency: 200 #frequency of the benchmark eval
eval_dataset_format: "qa" #format of the benchmark datasets (qa or beir)

An example can be found at scripts/configs/qwen2/train_colqwen2_model_eval_vidore.yaml

@tonywu71 tonywu71 self-requested a review February 17, 2025 08:30
@tonywu71 tonywu71 added the enhancement New feature or request label Feb 17, 2025
@tonywu71
Copy link
Collaborator

Recap from our conversation 👋🏼

Let's:

  • remove the legacy evaluation code
  • add optional training arg run_vidore_evalutor: if False, do not add the custom callback
  • add optional training args for vidore_eval_dataset_name and vidore_eval_collection_name (if both are fed, raise error)
  • add optional training arg to control how often the eval will run (e.g. once every 5 eval steps).

@tonywu71
Copy link
Collaborator

@QuentinJGMace vidore-benchmark v5.0.0 has been released, don't forget to bump this dep in pyprojetct.toml 😉

@ManuelFay
Copy link
Collaborator

@QuentinJGMace @tonywu71 updates ?

@tonywu71 tonywu71 force-pushed the vidore_eval_training branch from b4aba07 to dff8e6f Compare April 4, 2025 14:20
CHANGELOG.md Outdated
Comment on lines 10 to 20
### Changed

- Warn about evaluation being different from Vidore, and do not store results to prevent confusion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not true, update

@QuentinJGMace QuentinJGMace marked this pull request as ready for review April 17, 2025 14:55
@QuentinJGMace QuentinJGMace requested a review from tonywu71 April 17, 2025 14:55
@QuentinJGMace QuentinJGMace requested a review from Copilot April 17, 2025 14:56
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (2)

colpali_engine/utils/dataset_transformation.py:229

  • The class is defined as TestSetFactoryBEIR, but the call refers to TestSetFactory, which could lead to a NameError. Please update the class name in the call to match TestSetFactoryBEIR.
ds = TestSetFactory("vidore/tabfquad_test_subsampled")()

colpali_engine/trainer/eval_utils.py:110

  • The attribute 'self.eval_collection' is not defined in this class. It looks like it should reference an existing attribute, possibly related to the evaluation dataset loader. Please verify and update the reference.
print(f"Error during benchmark evaluation on collection '{self.eval_collection}': {e}")

Comment on lines +19 to +38
METRICS_TO_TRACK = [
"ndcg_at_1",
"ndcg_at_3",
"ndcg_at_5",
"ndcg_at_10",
"ndcg_at_50",
"ndcg_at_100",
"recall_at_1",
"recall_at_3",
"recall_at_5",
"recall_at_10",
"recall_at_50",
"recall_at_100",
"map_at_1",
"map_at_3",
"map_at_5",
"map_at_10",
"map_at_50",
"map_at_100",
]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is quite a lot to keep track of in a wandb window, especially on multiple datasets.

What are the few ones we should keep ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would pick nDCG@k and recall@k for $k \in {1, 5, 10}$ or $k \in {1, 3, 5, 10}$.

Comment on lines +109 to +110
except Exception as e:
print(f"Error during benchmark evaluation on collection '{self.eval_collection}': {e}")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update, not relevant anymore

Copy link
Collaborator

@ManuelFay ManuelFay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool thanks !
Let's wait for the updates in the Dataset code to merge this, so that we adapt it, super cool work thanks !!

eval_dataset_loader=self.config.eval_dataset_loader,
batch_query=self.config.tr_args.per_device_eval_batch_size,
batch_passage=self.config.tr_args.per_device_eval_batch_size,
batch_score=4,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's super low, you can probably push this to 256 at least

max_length (int): Maximum sequence length for inputs. Default: 256.
run_eval (bool): If True, runs evaluation. Default: True.
run_train (bool): If True, runs training. Default: True.
vidore_eval_frequency (int): Vidore evaluation frequency, must be a multiple of tr_args.eval_steps.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an assert to guarantee this ?

from peft import PeftModel
from transformers import PreTrainedModel, TrainerControl, TrainerState, TrainingArguments
from transformers.integrations import WandbCallback
from vidore_benchmark.evaluation.vidore_evaluators import ViDoReEvaluatorBEIR, ViDoReEvaluatorQA
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No circular imports anymore ?

print(f"\n=== Running benchmark evaluation at global step {state.global_step} ===")

# Evaluate on a collection.
if self.eval_dataset_loader is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for this to be none ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really, since this eval is deactivated by default I guess that a user wanting to use this should have specified eval datasets


def on_evaluate(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
if state.global_step % self.eval_steps_frequency != 0:
self.counter_eval += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do you use this ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need, it's an artefact from a previous implementation

Copy link
Collaborator

@tonywu71 tonywu71 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work overall, thanks a ton @QuentinJGMace! A few comments to address but otherwise LTGM :)

@@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/).

### Added

- Add the possibility for a user to evaluate a model on retrieval datasets (e.g ViDoRe benchmark) during its training.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Add the possibility for a user to evaluate a model on retrieval datasets (e.g ViDoRe benchmark) during its training.
- Add `BenchmarkEvalCallback` to evaluate a model on retrieval datasets (e.g ViDoRe benchmark) during its training and display the metrics on Weight&Biases.

dataset_format: str = "beir",
):
"""
Callback to evaluate the model on a collection of datasets during training.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add mention of Wandb in the docstring.

metrics_collection[test_name] = {k: v for k, v in metrics.items() if k in METRICS_TO_TRACK}
print(f"Benchmark metrics for tests datasets at step {state.global_step}:")
print(metrics_collection)
print("logging metrics to wandb")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print("logging metrics to wandb")

)
metrics_collection[test_name] = {k: v for k, v in metrics.items() if k in METRICS_TO_TRACK}
print(f"Benchmark metrics for tests datasets at step {state.global_step}:")
print(metrics_collection)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this look like? If it's not already formatted (since it's a dict), you can try pprint.pprint if needed!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we could remove the print entirely, it looks a bit messy when training

@@ -210,6 +210,21 @@ def __call__(self, *args, **kwargs):
return dataset


class TestSetFactoryBEIR:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 things here:

  1. Need a short docstring here
  2. It's not a factory per se (a factory is a design pattern that provides a way to create objects without specifying the exact class of object that will be created, which is not the case here). I think we should keep the implementation as simple as possible. Here is a recommendation below.
def load_beir_test_dataset(dataset_path: str, split: str = "test") -> Dict[str, Dataset]:
    return {
        "corpus": cast(Dataset, load_dataset(dataset_path, name="corpus", split=split)),
        "queries": cast(Dataset, load_dataset(dataset_path, name="queries", split=split)),
        "qrels": cast(Dataset, load_dataset(dataset_path, name="qrels", split=split)),
    }

Wdyt?

def __init__(self, dataset_path):
self.dataset_path = dataset_path

def __call__(self, *args, **kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 questions here:

  1. Is there a reason for keeping *args, **kwargs?
  2. Can we expose split: str = "test" in the __call__ args?
Suggested change
def __call__(self, *args, **kwargs):
def __call__(self, *args, **kwargs):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants