Apply suggestions from code review

yuanjua · stevhliu · web-flow · commit 9fd034d03e4a · 2025-05-28T09:41:25.000+08:00
Co-authored-by: Steven Liu &lt;59462357+stevhliu@users.noreply.github.com&gt;
diff --git a/docs/source/en/model_doc/mobilenet_v2.md b/docs/source/en/model_doc/mobilenet_v2.md
@@ -81,25 +81,23 @@ print(f"The predicted class label is: {predicted_class_label}")
 </hfoption>
 </hfoptions>
 
-<!-- Quantization - Not applicable -->
-<!-- Attention Visualization - Not applicable for this model type -->
 
 ## Notes
 
--   **Checkpoint Naming:** Classification checkpoints often follow `mobilenet_v2_{depth_multiplier}_{resolution}`, like `mobilenet_v2_1.4_224`. Segmentation checkpoints (using DeepLabV3+ head) might have names like `deeplabv3_mobilenet_v2_{depth_multiplier}_{resolution}`.
--   **Variable Input Size:** Like V1, the model works with images of different sizes (minimum 32x32), handled by [`MobileNetV2ImageProcessor`].
--   **1001 Classes (Classification):** ImageNet-1k pretrained classification models output 1001 classes (index 0 is background).
+-   Classification checkpoint names follow the pattern `mobilenet_v2_{depth_multiplier}_{resolution}`, like `mobilenet_v2_1.4_224`. `1.4` is the depth multiplier and `224` is the image resolution. Segmentation checkpoint names follow the pattern `deeplabv3_mobilenet_v2_{depth_multiplier}_{resolution}`.
+-   While trained on images of a specific sizes, the model architecture works with images of different sizes (minimum 32x32). The [`MobileNetV2ImageProcessor`] handles the necessary preprocessing.
+-   MobileNet is pretrained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k), a dataset with 1000 classes. However, the model actually predicts 1001 classes. The additional class is an extra "background" class (index 0).
 -   The segmentation models use a [DeepLabV3+](https://huggingface.co/papers/1802.02611) head which is often pretrained on datasets like [PASCAL VOC](https://huggingface.co/datasets/merve/pascal-voc).
--   **Padding Differences:** Similar to V1, original TensorFlow checkpoints had dynamic padding. The HF PyTorch implementation uses static padding by default. Enable dynamic padding (TF behavior) via `tf_padding=True` in [`MobileNetV2Config`].
+-   The original TensorFlow checkpoints determines the padding amount at inference because it depends on the input image size. To use the native PyTorch padding behavior, set `tf_padding=False` in [`MobileNetV2Config`].
     ```python
     from transformers import MobileNetV2Config
 
-    # Example: Load config with dynamic padding enabled
     config = MobileNetV2Config.from_pretrained("google/mobilenet_v2_1.4_224", tf_padding=True)
     ```
--   **Unsupported Features:**
-    -   The HF implementation uses global average pooling, not the optional fixed 7x7 average pooling from the original paper.
-    -   Extracting specific intermediate hidden states (e.g., from expansion layers 10/13) requires `output_hidden_states=True` (returning all states).
+-   The Transformers implementation does not support the following features.
+    -   Uses global average pooling instead of the optional 7x7 average pooling with stride 2. For larger inputs, this gives a pooled output that is larger than a 1x1 pixel.
+    -   `output_hidden_states=True` returns *all* intermediate hidden states. It is not possible to extract the output from specific layers for other downstream purposes.
+    - Does not include the quantized models from the original checkpoints because they include "FakeQuantization" operations to unquantize the weights.
     -   For segmentation models, the final convolution layer of the backbone is computed even though the DeepLabV3+ head doesn't use it.
 
 ## MobileNetV2Config