huggingface · Tanuj-rai · Jun 15, 2025 · Jun 15, 2025 · Jun 17, 2025 · Jun 17, 2025
diff --git a/docs/source/en/model_doc/phi4_multimodal.md b/docs/source/en/model_doc/phi4_multimodal.md
@@ -10,108 +10,67 @@ rendered properly in your Markdown viewer.
 -->
 
 # Phi4 Multimodal
+<div style="float: right;">
+  <div class="flex flex-wrap space-x-1">
+    <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-EE4C2C?logo=pytorch&logoColor=white&style=flat">
+  </div>
+</div>
 
-## Overview
+## Phi4 Multimodal
 
-Phi4 Multimodal is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures. The languages that each modal supports are the following:
+[Phi4 Multimodal](https://huggingface.co/papers/2503.01743) is a multimodal model capable of text, image, and speech and audio inputs or any combination of these. It features a mixture of LoRA adapters for handling different inputs, and each input is routed to the appropriate encoder.
 
-- Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
-- Vision: English
-- Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese
+You can find all the original Phi4 Multimodal checkpoints under the [Phi4](https://huggingface.co/collections/microsoft/phi-4-677e9380e514feb5577a40e4) collection.
 
-This model was contributed by [Cyril Vallez](https://huggingface.co/cyrilvallez). The most recent code can be
-found [here](https://github.com/huggingface/transformers/blob/main/src/transformers/models/phi4_multimodal/modeling_phi4_multimodal.py).
+> [!TIP]
+> This model was contributed by [cyrilvallez](https://huggingface.co/cyrilvallez).
+>
+> Click on the Phi-4 Multimodal in the right sidebar for more examples of how to apply Phi-4 Multimodal to different tasks.
 
+The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
 
-## Usage tips
+<hfoptions id="usage">
+<hfoption id="Pipeline">
 
-`Phi4-multimodal-instruct` can be found on the [Huggingface Hub](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)
+```python
+from transformers import pipeline
+generator = pipeline("text-generation", model="microsoft/Phi-4-multimodal-instruct", torch_dtype="auto", device=0)
+
+# Your input text prompt
+prompt = "Explain the concept of multimodal AI in simple terms."
 
-In the following, we demonstrate how to use it for inference depending on the input modalities (text, image, audio).
+# Generate output
+result = generator(prompt, max_length=50)
+print(result[0]['generated_text'])
+```
+
+</hfoption>
+<hfoption id="AutoModel">
 
 ```python
+from transformers import AutoProcessor, AutoModelForCausalLM
+from PIL import Image
 import torch
-from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
-
-
-# Define model path
-model_path = "microsoft/Phi-4-multimodal-instruct"
-device = "cuda:0"
-
-# Load model and processor
-processor = AutoProcessor.from_pretrained(model_path)
-model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device,  torch_dtype=torch.float16)
-
-# Optional: load the adapters (note that without them, the base model will very likely not work well)
-model.load_adapter(model_path, adapter_name="speech", device_map=device, adapter_kwargs={"subfolder": 'speech-lora'})
-model.load_adapter(model_path, adapter_name="vision", device_map=device, adapter_kwargs={"subfolder": 'vision-lora'})
-
-# Part : Image Processing
-messages = [
-    {
-        "role": "user",
-        "content": [
-            {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
-            {"type": "text", "text": "What is shown in this image?"},
-        ],
-    },
-]
-
-model.set_adapter("vision") # if loaded, activate the vision adapter
-inputs = processor.apply_chat_template(
-    messages,
-    add_generation_prompt=True,
-    tokenize=True,
-    return_dict=True,
-    return_tensors="pt",
-).to(device)
-
-# Generate response
-generate_ids = model.generate(
-    **inputs,
-    max_new_tokens=1000,
-    do_sample=False,
-)
-generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
-response = processor.batch_decode(
-    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
-)[0]
-print(f'>>> Response\n{response}')
-
-
-# Part 2: Audio Processing
-model.set_adapter("speech") # if loaded, activate the speech adapter
-audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"
-messages = [
-    {
-        "role": "user",
-        "content": [
-            {"type": "audio", "url": audio_url},
-            {"type": "text", "text": "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the origina transcript and the translation."},
-        ],
-    },
-]
-
-inputs = processor.apply_chat_template(
-    messages,
-    add_generation_prompt=True,
-    tokenize=True,
-    return_dict=True,
-    return_tensors="pt",
-).to(device)
-
-generate_ids = model.generate(
-    **inputs,
-    max_new_tokens=1000,
-    do_sample=False,
-)
-generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
-response = processor.batch_decode(
-    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
-)[0]
-print(f'>>> Response\n{response}')
+
+tokenizer = AutoProcessor.from_pretrained("microsoft/Phi-4-multimodal-instruct")
+model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-multimodal-instruct", torch_dtype=torch.bfloat16).to("cuda")
+
+# Load image
+image = Image.open("your_image.png")
+
+# Prepare inputs
+inputs = tokenizer(text="Describe this image:", images=image, return_tensors="pt").to("cuda")
+
+# Generate output
+outputs = model.generate(**inputs, max_length=200)
+
+# Decode output
+generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(generated_text)
 ```
 
+## Notes
+
 ## Phi4MultimodalFeatureExtractor
 
 [[autodoc]] Phi4MultimodalFeatureExtractor