Skip to content

Commit 43202ee

Browse files
mreraserstevhliu
authored andcommitted
Update model_doc/vit_mae.md
1 parent 069f798 commit 43202ee

File tree

1 file changed

+5
-14
lines changed

1 file changed

+5
-14
lines changed

docs/source/en/model_doc/vit_mae.md

Lines changed: 5 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,9 @@ rendered properly in your Markdown viewer.
2828

2929
[ViTMAE](https://huggingface.co/papers/2111.06377) is a self-supervised vision model that is pretrained by masking large portions of an image (~75%). An encoder processes the visible image patches and a decoder reconstructs the missing pixels from the encoded patches and mask tokens. After pretraining, the encoder can be reused for downstream tasks like image classification or object detection — often outperforming models trained with supervised learning.
3030

31+
<img src="https://user-images.githubusercontent.com/11435359/146857310-f258c86c-fde6-48e8-9cee-badd2b21bd2c.png"
32+
alt="drawing" width="600"/>
33+
3134
You can find all the original ViTMAE checkpoints under the [AI at Meta](https://huggingface.co/facebook?search_models=vit-mae) organization.
3235

3336
> [!TIP]
@@ -36,10 +39,6 @@ You can find all the original ViTMAE checkpoints under the [AI at Meta](https://
3639
The example below demonstrates how to reconstruct the missing pixels with the [`AutoModel`] class.
3740

3841
<hfoptions id="usage">
39-
40-
<!-- This model is not currently supported via pipeline. -->
41-
42-
</hfoption>
4342
<hfoption id="AutoModel">
4443

4544
```python
@@ -52,7 +51,8 @@ url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/
5251
image = Image.open(requests.get(url, stream=True).raw)
5352

5453
processor = ViTImageProcessor.from_pretrained("facebook/vit-mae-base")
55-
inputs = processor(image, return_tensors="pt").to("cuda")
54+
inputs = processor(image, return_tensors="pt")
55+
inputs = {k: v.to("cuda") for k, v in inputs.items()}
5656

5757
model = ViTMAEForPreTraining.from_pretrained("facebook/vit-mae-base", attn_implementation="sdpa").to("cuda")
5858
with torch.no_grad():
@@ -61,22 +61,13 @@ with torch.no_grad():
6161
reconstruction = outputs.logits
6262
```
6363

64-
</hfoption>
65-
<hfoption id="transformers-cli">
66-
67-
<!-- This model is not currently supported via transformers-cli. -->
68-
6964
</hfoption>
7065
</hfoptions>
7166

7267
## Notes
7368
- ViTMAE is typically used in two stages. Self-supervised pretraining with [`ViTMAEForPreTraining`], and then discarding the decoder and fine-tuning the encoder. After fine-tuning, the weights can be plugged into a model like [`ViTForImageClassification`].
7469
- Use [`ViTImageProcessor`] for input preparation.
7570

76-
```python
77-
from transformers import ViTMAEModel
78-
model = ViTMAEModel.from_pretrained("facebook/vit-mae-base", attn_implementation="sdpa", torch_dtype=torch.float16)
79-
...
8071
## Resources
8172

8273
- Refer to this [notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/ViTMAE/ViT_MAE_visualization_demo.ipynb) to learn how to visualize the reconstructed pixels from [`ViTMAEForPreTraining`].

0 commit comments

Comments
 (0)