Add support for nested images to LLava and VipLLava #35558

yonigozlan · 2025-01-07T23:37:58Z

What does this PR do?

This PR adds the functions make_flat_list_of_images , make_nested_list_of_images and make_batched_videos to image_utils, removing some unnecessarily duplicated code.
make_flat_list_of_images also replaces make_list_of_images in clip, blip, and siglip image processors, as it allows image-text-to-text models which use these image processors to support nested images inputs, while preserving BC.

Partially addresses #34545

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@zucchini-nlp

HuggingFaceDocBuilderDev · 2025-01-08T00:16:20Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Rocketknight1 · 2025-01-08T14:03:13Z

Flagging this PR too - I made some changes to the Llava/Pixtral processing for nested images here, so there might be some conflicts! #34801

zucchini-nlp

Super cool, thanks for cleaning this up. Looks much better now

zucchini-nlp · 2025-01-08T18:21:51Z

src/transformers/image_utils.py

+    if is_valid_image(images):
+        output_images = [[images]]


could it be that we get a 4D tensor as a batch of images?

Yes you are right! Made some changes that should account for that

yonigozlan · 2025-01-09T22:23:14Z

src/transformers/image_utils.py

@@ -209,6 +213,107 @@ def make_list_of_images(images, expected_ndims: int = 3) -> List[ImageInput]:
    )


+def make_flat_list_of_images(


This can also return a 4d array/tensor, as that's how it was originally implemented in processors that use this function, so the name might be a bit misleading?

Wondering if a 4D array is really necessary in processors where it is called? AFAIK we always iterate over each image, which mean in the end we'll anyway process one 3D image

In that case, we can only return an actual list of images

Hmm good point, it will also be more aligned with make_list_of_images. I will make the change and check that it doesn't break anything. Thanks!

yonigozlan · 2025-01-10T15:48:57Z

Done! @ArthurZucker this is ready for review :)

yonigozlan · 2025-01-21T17:51:58Z

Hey @ArthurZucker If you have some bandwidth for a review, this PR would be a nice step towards uniformizing how we handle image/video processing

zucchini-nlp

@yonigozlan hey, I'm working now on chat templates which will be unblocked by your PR. Can we also add video batch support, since it's only a few lines change

There are more cases when one can pass it also as [[[frame, frame, frame]], [frame, frame, frame]] for example. But no-one does it that way, neither chat templates need so much nesting. The change I suggested might be enough

zucchini-nlp · 2025-01-28T13:55:25Z

src/transformers/image_utils.py

+        list: A list of videos.
+    """
+    if isinstance(videos, (list, tuple)) and isinstance(videos[0], (list, tuple)) and is_valid_image(videos[0][0]):
+        return videos


Since we standardizaed images, let's do same for videos. We need batched list support for all modalities, to make chat templates happy for batch formatting

Suggested change

return videos

# case 1: nested batch of videos so we flatten it

if not isinstance(videos[0][0], Image.Image) and videos[0][0].ndim==4:

videos = [video for batch_list in videos for video in batch_list]

# case 2: list of videos represented as list of video frames

ArthurZucker

Nice cleanup! 🤗

…xt-to-text-inputs-processing

* move make_flat_list_of_images and make_batched_videos to image_utils * remove unnecessary is_vision_available * move make_nested_list_of_images to image_utils * fix fast pixtral image processor * fix import mllama * fix make_nested_list_of_images * add tests * convert 4d arrays/tensors to list * add test_make_batched_videos * add support nested batch of videos * fix image processing qwen2vl

yonigozlan requested a review from zucchini-nlp January 8, 2025 15:03

zucchini-nlp approved these changes Jan 9, 2025

View reviewed changes

yonigozlan requested review from qubvel, molbap, Rocketknight1 and ArthurZucker as code owners January 9, 2025 17:05

yonigozlan force-pushed the uniformize-image-text-to-text-inputs-processing branch from 0319100 to 6f595da Compare January 9, 2025 17:10

yonigozlan commented Jan 9, 2025

View reviewed changes

yonigozlan added 9 commits January 14, 2025 20:17

move make_flat_list_of_images and make_batched_videos to image_utils

3e5d37c

remove unnecessary is_vision_available

a95e445

move make_nested_list_of_images to image_utils

423e9d4

fix fast pixtral image processor

948f93b

fix import mllama

30a2d54

fix make_nested_list_of_images

a4a90aa

add tests

7583418

convert 4d arrays/tensors to list

f0da40b

add test_make_batched_videos

77ed530

yonigozlan force-pushed the uniformize-image-text-to-text-inputs-processing branch from 9aeed52 to 77ed530 Compare January 14, 2025 20:17

qubvel removed their request for review January 20, 2025 18:26

yonigozlan removed request for molbap and Rocketknight1 January 21, 2025 17:50

zucchini-nlp reviewed Jan 28, 2025

View reviewed changes

add support nested batch of videos

d311b79

zucchini-nlp mentioned this pull request Jan 29, 2025

Chat template: update for processor #35953

Merged

ArthurZucker approved these changes Jan 30, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/main' into uniformize-image-te…

4da4ca1

…xt-to-text-inputs-processing

yonigozlan and others added 2 commits January 30, 2025 20:47

fix image processing qwen2vl

8f1bc86

Merge branch 'main' into uniformize-image-text-to-text-inputs-processing

8346963

yonigozlan merged commit d7188ba into huggingface:main Jan 30, 2025
25 checks passed

hiyouga mentioned this pull request Mar 12, 2025

Support batch size > 1 image-text inference #36682

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for nested images to LLava and VipLLava #35558

Add support for nested images to LLava and VipLLava #35558

Uh oh!

yonigozlan commented Jan 7, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jan 8, 2025

Uh oh!

Rocketknight1 commented Jan 8, 2025

Uh oh!

zucchini-nlp left a comment

Uh oh!

zucchini-nlp Jan 8, 2025

Uh oh!

yonigozlan Jan 9, 2025

Uh oh!

yonigozlan Jan 9, 2025

Uh oh!

zucchini-nlp Jan 10, 2025

Uh oh!

yonigozlan Jan 10, 2025

Uh oh!

yonigozlan commented Jan 10, 2025

Uh oh!

yonigozlan commented Jan 21, 2025

Uh oh!

zucchini-nlp left a comment

Uh oh!

zucchini-nlp Jan 28, 2025

Uh oh!

yonigozlan Jan 28, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

Uh oh!

Uh oh!

		@@ -209,6 +213,107 @@ def make_list_of_images(images, expected_ndims: int = 3) -> List[ImageInput]:
		)


		def make_flat_list_of_images(

-        return videos
+        # case 1: nested batch of videos so we flatten it
+        if not isinstance(videos[0][0], Image.Image) and videos[0][0].ndim==4:
+            videos = [video for batch_list in videos for video in batch_list]
+        # case 2: list of videos represented as list of video frames

Add support for nested images to LLava and VipLLava #35558

Add support for nested images to LLava and VipLLava #35558

Uh oh!

Conversation

yonigozlan commented Jan 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Jan 8, 2025

Uh oh!

Rocketknight1 commented Jan 8, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Jan 8, 2025

Choose a reason for hiding this comment

Uh oh!

yonigozlan Jan 9, 2025

Choose a reason for hiding this comment

Uh oh!

yonigozlan Jan 9, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Jan 10, 2025

Choose a reason for hiding this comment

Uh oh!

yonigozlan Jan 10, 2025

Choose a reason for hiding this comment

Uh oh!

yonigozlan commented Jan 10, 2025

Uh oh!

yonigozlan commented Jan 21, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Jan 28, 2025

Choose a reason for hiding this comment

Uh oh!

yonigozlan Jan 28, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yonigozlan commented Jan 7, 2025 •

edited

Loading