Update model_doc/vit_mae.md

mreraser · stevhliu · commit 43202ee17aba · 2025-05-28T08:54:04.000-07:00
diff --git a/docs/source/en/model_doc/vit_mae.md b/docs/source/en/model_doc/vit_mae.md
@@ -28,6 +28,9 @@ rendered properly in your Markdown viewer.
 
 [ViTMAE](https://huggingface.co/papers/2111.06377) is a self-supervised vision model that is pretrained by masking large portions of an image (~75%). An encoder processes the visible image patches and a decoder reconstructs the missing pixels from the encoded patches and mask tokens. After pretraining, the encoder can be reused for downstream tasks like image classification or object detection — often outperforming models trained with supervised learning.
 
+<img src="https://user-images.githubusercontent.com/11435359/146857310-f258c86c-fde6-48e8-9cee-badd2b21bd2c.png"
+alt="drawing" width="600"/> 
+
 You can find all the original ViTMAE checkpoints under the [AI at Meta](https://huggingface.co/facebook?search_models=vit-mae) organization.
 
 > [!TIP]
@@ -36,10 +39,6 @@ You can find all the original ViTMAE checkpoints under the [AI at Meta](https://
 The example below demonstrates how to reconstruct the missing pixels with the [`AutoModel`] class.
 
 <hfoptions id="usage">
-
-<!-- This model is not currently supported via pipeline. -->
-
-</hfoption>
 <hfoption id="AutoModel">
 
 ```python
@@ -52,7 +51,8 @@ url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/
 image = Image.open(requests.get(url, stream=True).raw)
 
 processor = ViTImageProcessor.from_pretrained("facebook/vit-mae-base")
-inputs = processor(image, return_tensors="pt").to("cuda")
+inputs = processor(image, return_tensors="pt")
+inputs = {k: v.to("cuda") for k, v in inputs.items()}
 
 model = ViTMAEForPreTraining.from_pretrained("facebook/vit-mae-base", attn_implementation="sdpa").to("cuda")
 with torch.no_grad():
@@ -61,22 +61,13 @@ with torch.no_grad():
 reconstruction = outputs.logits
 ```
 
-</hfoption>
-<hfoption id="transformers-cli">
-
-<!-- This model is not currently supported via transformers-cli. -->
-
 </hfoption>
 </hfoptions>
 
 ## Notes
 - ViTMAE is typically used in two stages. Self-supervised pretraining with [`ViTMAEForPreTraining`], and then discarding the decoder and fine-tuning the encoder. After fine-tuning, the weights can be plugged into a model like [`ViTForImageClassification`].
 - Use [`ViTImageProcessor`] for input preparation.
 
-```python
-from transformers import ViTMAEModel
-model = ViTMAEModel.from_pretrained("facebook/vit-mae-base", attn_implementation="sdpa", torch_dtype=torch.float16)
-...
 ## Resources
 
 - Refer to this [notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/ViTMAE/ViT_MAE_visualization_demo.ipynb) to learn how to visualize the reconstructed pixels from [`ViTMAEForPreTraining`].