Skip to content

Add support for nested images to LLava and VipLLava #35558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

yonigozlan
Copy link
Member

@yonigozlan yonigozlan commented Jan 7, 2025

What does this PR do?

This PR adds the functions make_flat_list_of_images , make_nested_list_of_images and make_batched_videos to image_utils, removing some unnecessarily duplicated code.
make_flat_list_of_images also replaces make_list_of_images in clip, blip, and siglip image processors, as it allows image-text-to-text models which use these image processors to support nested images inputs, while preserving BC.

Partially addresses #34545

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@zucchini-nlp

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Rocketknight1
Copy link
Member

Flagging this PR too - I made some changes to the Llava/Pixtral processing for nested images here, so there might be some conflicts! #34801

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super cool, thanks for cleaning this up. Looks much better now

Comment on lines 263 to 264
if is_valid_image(images):
output_images = [[images]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could it be that we get a 4D tensor as a batch of images?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right! Made some changes that should account for that

@yonigozlan yonigozlan force-pushed the uniformize-image-text-to-text-inputs-processing branch from 0319100 to 6f595da Compare January 9, 2025 17:10
@@ -209,6 +213,107 @@ def make_list_of_images(images, expected_ndims: int = 3) -> List[ImageInput]:
)


def make_flat_list_of_images(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also return a 4d array/tensor, as that's how it was originally implemented in processors that use this function, so the name might be a bit misleading?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if a 4D array is really necessary in processors where it is called? AFAIK we always iterate over each image, which mean in the end we'll anyway process one 3D image

In that case, we can only return an actual list of images

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm good point, it will also be more aligned with make_list_of_images. I will make the change and check that it doesn't break anything. Thanks!

@yonigozlan
Copy link
Member Author

Done! @ArthurZucker this is ready for review :)

@yonigozlan yonigozlan force-pushed the uniformize-image-text-to-text-inputs-processing branch from 9aeed52 to 77ed530 Compare January 14, 2025 20:17
@qubvel qubvel removed their request for review January 20, 2025 18:26
@yonigozlan
Copy link
Member Author

Hey @ArthurZucker If you have some bandwidth for a review, this PR would be a nice step towards uniformizing how we handle image/video processing

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yonigozlan hey, I'm working now on chat templates which will be unblocked by your PR. Can we also add video batch support, since it's only a few lines change

There are more cases when one can pass it also as [[[frame, frame, frame]], [frame, frame, frame]] for example. But no-one does it that way, neither chat templates need so much nesting. The change I suggested might be enough

list: A list of videos.
"""
if isinstance(videos, (list, tuple)) and isinstance(videos[0], (list, tuple)) and is_valid_image(videos[0][0]):
return videos
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we standardizaed images, let's do same for videos. We need batched list support for all modalities, to make chat templates happy for batch formatting

Suggested change
return videos
# case 1: nested batch of videos so we flatten it
if not isinstance(videos[0][0], Image.Image) and videos[0][0].ndim==4:
videos = [video for batch_list in videos for video in batch_list]
# case 2: list of videos represented as list of video frames

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup! 🤗

@yonigozlan yonigozlan merged commit d7188ba into huggingface:main Jan 30, 2025
25 checks passed
yaswanth19 pushed a commit to yaswanth19/transformers that referenced this pull request Feb 5, 2025
* move make_flat_list_of_images and make_batched_videos to image_utils

* remove unnecessary is_vision_available

* move make_nested_list_of_images to image_utils

* fix fast pixtral image processor

* fix import mllama

* fix make_nested_list_of_images

* add tests

* convert 4d arrays/tensors to list

* add test_make_batched_videos

* add support nested batch of videos

* fix image processing qwen2vl
bursteratom pushed a commit to bursteratom/transformers that referenced this pull request Feb 5, 2025
* move make_flat_list_of_images and make_batched_videos to image_utils

* remove unnecessary is_vision_available

* move make_nested_list_of_images to image_utils

* fix fast pixtral image processor

* fix import mllama

* fix make_nested_list_of_images

* add tests

* convert 4d arrays/tensors to list

* add test_make_batched_videos

* add support nested batch of videos

* fix image processing qwen2vl
elvircrn pushed a commit to elvircrn/transformers that referenced this pull request Feb 13, 2025
* move make_flat_list_of_images and make_batched_videos to image_utils

* remove unnecessary is_vision_available

* move make_nested_list_of_images to image_utils

* fix fast pixtral image processor

* fix import mllama

* fix make_nested_list_of_images

* add tests

* convert 4d arrays/tensors to list

* add test_make_batched_videos

* add support nested batch of videos

* fix image processing qwen2vl
sbucaille pushed a commit to sbucaille/transformers that referenced this pull request Feb 16, 2025
* move make_flat_list_of_images and make_batched_videos to image_utils

* remove unnecessary is_vision_available

* move make_nested_list_of_images to image_utils

* fix fast pixtral image processor

* fix import mllama

* fix make_nested_list_of_images

* add tests

* convert 4d arrays/tensors to list

* add test_make_batched_videos

* add support nested batch of videos

* fix image processing qwen2vl
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants