Description
参考Omni训练教程:
nproc_per_node=8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
TRANSFORMERS_CACHE=/data/xiangxi/cache
HF_HOME=/data/xiangxi/cache
HUGGINGFACE_HUB_CACHE=/data/xiangxi/cache
MODELSCOPE_CACHE=/data/xiangxi/cache/modelscope
MASTER_ADDR=localhost
MASTER_PORT=12355
NPROC_PER_NODE=$nproc_per_node
VIDEO_MAX_PIXELS=50176
FPS_MAX_FRAMES=12
MAX_PIXELS=1003520
ENABLE_AUDIO_OUTPUT=0
swift sft
--model Qwen/Qwen2.5-Omni-7B
--dataset -
--train_type lora
--torch_dtype bfloat16
--num_train_epochs 2
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--learning_rate 2e-5
--lora_rank 8
--lora_alpha 32
--target_modules all-linear
--freeze_vit true
--gradient_accumulation_steps $(expr 32 / $nproc_per_node)
--eval_steps 100
--save_steps 200
--save_total_limit 10
--logging_steps 4
--max_length 2048
--output_dir /data/xiangxi/tmp/Omni/output_img_2w_cmd_distill_6_9_tmp
--warmup_ratio 0.05
--max_new_tokens 1
--dataloader_num_workers 8
--deepspeed zero2
基于上面的训练,会在模型加载后就卡住,卡住地方打印的log如下:
[INFO:swift] Successfully registered post_encode hook: ['Qwen2_5OmniForConditionalGeneration'].
Gradient accumulation steps mismatch: GradientAccumulationPlugin has 1, DeepSpeed config has 4. Using DeepSpeed's value.
显卡占用情况如下:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:1A:00.0 Off | Off |
| 30% 34C P8 21W / 300W | 10480MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA RTX A6000 Off | 00000000:1B:00.0 Off | Off |
| 30% 51C P2 98W / 300W | 448MiB / 49140MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA RTX A6000 Off | 00000000:3D:00.0 Off | Off |
| 30% 45C P2 97W / 300W | 432MiB / 49140MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA RTX A6000 Off | 00000000:3E:00.0 Off | Off |
| 30% 52C P2 98W / 300W | 448MiB / 49140MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA RTX A6000 Off | 00000000:88:00.0 Off | Off |
| 30% 50C P2 107W / 300W | 432MiB / 49140MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA RTX A6000 Off | 00000000:89:00.0 Off | Off |
| 30% 53C P2 100W / 300W | 448MiB / 49140MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA RTX A6000 Off | 00000000:B1:00.0 Off | Off |
| 30% 56C P2 108W / 300W | 432MiB / 49140MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA RTX A6000 Off | 00000000:B2:00.0 Off | Off |
| 30% 47C P2 99W / 300W | 448MiB / 49140MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
请问这种问题应该怎么解?