-
Notifications
You must be signed in to change notification settings - Fork 29.5k
Update phi4_multimodal.md #38830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Tanuj-rai
wants to merge
8
commits into
huggingface:main
Choose a base branch
from
Tanuj-rai:update-phi4-multimodal-card
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Update phi4_multimodal.md #38830
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
860a257
Update phi4_multimodal.md
Tanuj-rai 7fae112
Merge branch 'main' into update-phi4-multimodal-card
Tanuj-rai 96219c5
Update docs/source/en/model_doc/phi4_multimodal.md
Tanuj-rai 826249d
Update docs/source/en/model_doc/phi4_multimodal.md
Tanuj-rai 4033187
Update docs/source/en/model_doc/phi4_multimodal.md
Tanuj-rai 6fcb264
Update docs/source/en/model_doc/phi4_multimodal.md
Tanuj-rai aef927a
Update docs/source/en/model_doc/phi4_multimodal.md
Tanuj-rai a307a02
Update phi4_multimodal.md
Tanuj-rai File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,108 +10,67 @@ rendered properly in your Markdown viewer. | |
--> | ||
|
||
# Phi4 Multimodal | ||
<div style="float: right;"> | ||
<div class="flex flex-wrap space-x-1"> | ||
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-EE4C2C?logo=pytorch&logoColor=white&style=flat"> | ||
</div> | ||
</div> | ||
|
||
## Overview | ||
## Phi4 Multimodal | ||
|
||
Phi4 Multimodal is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures. The languages that each modal supports are the following: | ||
[Phi4 Multimodal](https://huggingface.co/papers/2503.01743) is a multimodal model capable of text, image, and speech and audio inputs or any combination of these. It features a mixture of LoRA adapters for handling different inputs, and each input is routed to the appropriate encoder. | ||
|
||
- Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian | ||
- Vision: English | ||
- Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese | ||
You can find all the original Phi4 Multimodal checkpoints under the [Phi4](https://huggingface.co/collections/microsoft/phi-4-677e9380e514feb5577a40e4) collection. | ||
|
||
This model was contributed by [Cyril Vallez](https://huggingface.co/cyrilvallez). The most recent code can be | ||
found [here](https://github.com/huggingface/transformers/blob/main/src/transformers/models/phi4_multimodal/modeling_phi4_multimodal.py). | ||
> [!TIP] | ||
> This model was contributed by [cyrilvallez](https://huggingface.co/cyrilvallez). | ||
> | ||
> Click on the Phi-4 Multimodal in the right sidebar for more examples of how to apply Phi-4 Multimodal to different tasks. | ||
|
||
The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class. | ||
|
||
## Usage tips | ||
<hfoptions id="usage"> | ||
<hfoption id="Pipeline"> | ||
|
||
`Phi4-multimodal-instruct` can be found on the [Huggingface Hub](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) | ||
```python | ||
from transformers import pipeline | ||
generator = pipeline("text-generation", model="microsoft/Phi-4-multimodal-instruct", torch_dtype="auto", device=0) | ||
|
||
# Your input text prompt | ||
prompt = "Explain the concept of multimodal AI in simple terms." | ||
|
||
In the following, we demonstrate how to use it for inference depending on the input modalities (text, image, audio). | ||
# Generate output | ||
result = generator(prompt, max_length=50) | ||
print(result[0]['generated_text']) | ||
``` | ||
|
||
</hfoption> | ||
<hfoption id="AutoModel"> | ||
|
||
```python | ||
from transformers import AutoProcessor, AutoModelForCausalLM | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hey @Cyrilvallez, when I try to run the code in the Usage tips, I get the following error. Would you mind taking a look please? 😄
|
||
from PIL import Image | ||
import torch | ||
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig | ||
|
||
|
||
# Define model path | ||
model_path = "microsoft/Phi-4-multimodal-instruct" | ||
device = "cuda:0" | ||
|
||
# Load model and processor | ||
processor = AutoProcessor.from_pretrained(model_path) | ||
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device, torch_dtype=torch.float16) | ||
|
||
# Optional: load the adapters (note that without them, the base model will very likely not work well) | ||
model.load_adapter(model_path, adapter_name="speech", device_map=device, adapter_kwargs={"subfolder": 'speech-lora'}) | ||
model.load_adapter(model_path, adapter_name="vision", device_map=device, adapter_kwargs={"subfolder": 'vision-lora'}) | ||
|
||
# Part : Image Processing | ||
messages = [ | ||
{ | ||
"role": "user", | ||
"content": [ | ||
{"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"}, | ||
{"type": "text", "text": "What is shown in this image?"}, | ||
], | ||
}, | ||
] | ||
|
||
model.set_adapter("vision") # if loaded, activate the vision adapter | ||
inputs = processor.apply_chat_template( | ||
messages, | ||
add_generation_prompt=True, | ||
tokenize=True, | ||
return_dict=True, | ||
return_tensors="pt", | ||
).to(device) | ||
|
||
# Generate response | ||
generate_ids = model.generate( | ||
**inputs, | ||
max_new_tokens=1000, | ||
do_sample=False, | ||
) | ||
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:] | ||
response = processor.batch_decode( | ||
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False | ||
)[0] | ||
print(f'>>> Response\n{response}') | ||
|
||
|
||
# Part 2: Audio Processing | ||
model.set_adapter("speech") # if loaded, activate the speech adapter | ||
audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac" | ||
messages = [ | ||
{ | ||
"role": "user", | ||
"content": [ | ||
{"type": "audio", "url": audio_url}, | ||
{"type": "text", "text": "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the origina transcript and the translation."}, | ||
], | ||
}, | ||
] | ||
|
||
inputs = processor.apply_chat_template( | ||
messages, | ||
add_generation_prompt=True, | ||
tokenize=True, | ||
return_dict=True, | ||
return_tensors="pt", | ||
).to(device) | ||
|
||
generate_ids = model.generate( | ||
**inputs, | ||
max_new_tokens=1000, | ||
do_sample=False, | ||
) | ||
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:] | ||
response = processor.batch_decode( | ||
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False | ||
)[0] | ||
print(f'>>> Response\n{response}') | ||
|
||
tokenizer = AutoProcessor.from_pretrained("microsoft/Phi-4-multimodal-instruct") | ||
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-multimodal-instruct", torch_dtype=torch.bfloat16).to("cuda") | ||
|
||
# Load image | ||
image = Image.open("your_image.png") | ||
|
||
# Prepare inputs | ||
inputs = tokenizer(text="Describe this image:", images=image, return_tensors="pt").to("cuda") | ||
|
||
# Generate output | ||
outputs = model.generate(**inputs, max_length=200) | ||
|
||
# Decode output | ||
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) | ||
print(generated_text) | ||
``` | ||
|
||
## Notes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add: - The example below demonstrates inference with an audio and text input.
```py
add the audio processing code example here
|
||
|
||
## Phi4MultimodalFeatureExtractor | ||
|
||
[[autodoc]] Phi4MultimodalFeatureExtractor | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Cyrilvallez, this code example also returns the following error: