Skip to content

added mllama doc #37647

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 17 commits into from
Closed

added mllama doc #37647

wants to merge 17 commits into from

Conversation

Nikil-D-Gr8
Copy link

@Nikil-D-Gr8 Nikil-D-Gr8 commented Apr 21, 2025

What does this PR do?
As suggested in this issue #issue-2947704577 - this PR updates the documentation of the CLIP model, which will now be aligned with the standardized format for all the docs.
Worked on mllama, Used AI , so please let me know even if u need a complete rewrite.

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).

Please let me know if there are any changes to be done, do share references if any for those changes
Documentation: @stevhliu

-->

@github-actions github-actions bot marked this pull request as draft April 21, 2025 04:59
Copy link
Contributor

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for your contribution!

Let's also add an example with the AttentionMaskVisualizer


Mllama has an extra token used as a placeholder for image positions in the text. It means that input ids and an input embedding layer will have an extra token. But since the weights for input and output embeddings are not tied, the `lm_head` layer has one less token and will fail if you want to calculate loss on image tokens or apply some logit processors. In case you are training, make sure to mask out special `"<|image|>"` tokens in the `labels` as the model should not be trained on predicting them.
```python
from transformers import pipeline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets use a real image for the example here:

import torch
from transformers import pipeline

pipeline = pipeline(
    task="image-text-to-text",
    model="meta-llama/Llama-3.2-11B-Vision-Instruct",
    device=0,
    torch_dtype=torch.bfloat16
)
messages = [
    [
        {
            "role": "user", 
            "content": [
                {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
                {"type": "text", "text": "What does the image show?"}
            ]
        }
    ],
]
pipeline(text=messages, return_full_text=False)

import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor
from transformers import AutoModelForCausalLM
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the BitsAndBytesConfig

import torch
from transformers import BitsAndBytesConfig, MllamaForConditionalGeneration, AutoProcessor

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = MllamaForConditionalGeneration.from_pretrained(
    "meta-llama/Llama-3.2-11B-Vision-Instruct",
    device_map="auto", 
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",
    quantization_config=bnb_config
)
processor = AutoProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct")

messages = [
    [
        {
            "role": "user", 
            "content": [
                {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
                {"type": "text", "text": "What does the image show?"}
            ]
        }
    ],
]
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to("cuda")
output = model.generate(**inputs, max_new_tokens=25)
print(processor.decode(output[0]))

- When training, mask out the `<|image|>` tokens in labels
- For CUDA index errors during generation, expand the `lm_head`:

```python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent this code block so it falls under the last list item

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember to indent here!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not indented

Comment on lines 133 to 141
## MllamaForCausalLM

[[autodoc]] MllamaForCausalLM
- forward

## MllamaVisionModel

[[autodoc]] MllamaVisionModel
- forward
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add these docstrings back!

Nikil-D-Gr8 and others added 8 commits April 22, 2025 11:00
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
@Nikil-D-Gr8
Copy link
Author

@stevhliu Thanks a lot for the help mate, used ai a bit here and there thinking it can do a better job, looks like it just gave u more work, new to all this , will keep it organic from now,thanks, let me know if i can make any more edits tho

@Nikil-D-Gr8 Nikil-D-Gr8 marked this pull request as ready for review April 22, 2025 06:23

## Usage Example
For quantized inference, use `BitsAndBytesConfig`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For quantized inference, use `BitsAndBytesConfig`:
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to 4-bits.

Nikil-D-Gr8 and others added 5 commits April 23, 2025 11:12
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
model_id = "meta-llama/Llama-3.2-11B-Vision"
model = MllamaForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained(model_id)
<div class="flex justify-center">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this image and replace it with the quantization example in this comment.

import torch
from transformers import BitsAndBytesConfig, MllamaForConditionalGeneration, AutoProcessor

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = MllamaForConditionalGeneration.from_pretrained(
    "meta-llama/Llama-3.2-11B-Vision-Instruct",
    device_map="auto", 
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",
    quantization_config=bnb_config
)
processor = AutoProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct")

messages = [
    [
        {
            "role": "user", 
            "content": [
                {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
                {"type": "text", "text": "What does the image show?"}
            ]
        }
    ],
]
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to("cuda")
output = model.generate(**inputs, max_new_tokens=25)
print(processor.decode(output[0]))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate from the AutoModel example and outside of the <hfoption> block, you should have a separate code example for quantization as shown in the code snippet above.

The image hasn't been removed yet

```python
import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor
from transformers import BitsAndBytesConfig, MllamaForConditionalGeneration, AutoProcessor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from transformers import BitsAndBytesConfig, MllamaForConditionalGeneration, AutoProcessor
from transformers import MllamaForConditionalGeneration, AutoProcessor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AutoModel example shouldn't show quantization usage, so it was fine the way it was before. I was just removing BitsAndBytesConfig from the import

- When training, mask out the `<|image|>` tokens in labels
- For CUDA index errors during generation, expand the `lm_head`:

```python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember to indent here!

Nikil-D-Gr8 and others added 2 commits April 23, 2025 23:24
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
@Nikil-D-Gr8
Copy link
Author

Nikil-D-Gr8 commented Apr 23, 2025

is it fine now?
@stevhliu

```python
import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor
from transformers import BitsAndBytesConfig, MllamaForConditionalGeneration, AutoProcessor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AutoModel example shouldn't show quantization usage, so it was fine the way it was before. I was just removing BitsAndBytesConfig from the import

model_id = "meta-llama/Llama-3.2-11B-Vision"
model = MllamaForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained(model_id)
<div class="flex justify-center">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate from the AutoModel example and outside of the <hfoption> block, you should have a separate code example for quantization as shown in the code snippet above.

The image hasn't been removed yet

- When training, mask out the `<|image|>` tokens in labels
- For CUDA index errors during generation, expand the `lm_head`:

```python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not indented

Comment on lines +164 to +168





Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add all these extra lines at the end either

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants