Skip to content

训练Omni的时候会卡住不动 #4651

Open
@fiona-lxd

Description

@fiona-lxd

参考Omni训练教程:
nproc_per_node=8

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
TRANSFORMERS_CACHE=/data/xiangxi/cache
HF_HOME=/data/xiangxi/cache
HUGGINGFACE_HUB_CACHE=/data/xiangxi/cache
MODELSCOPE_CACHE=/data/xiangxi/cache/modelscope
MASTER_ADDR=localhost
MASTER_PORT=12355
NPROC_PER_NODE=$nproc_per_node
VIDEO_MAX_PIXELS=50176
FPS_MAX_FRAMES=12
MAX_PIXELS=1003520
ENABLE_AUDIO_OUTPUT=0
swift sft
--model Qwen/Qwen2.5-Omni-7B
--dataset -
--train_type lora
--torch_dtype bfloat16
--num_train_epochs 2
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--learning_rate 2e-5
--lora_rank 8
--lora_alpha 32
--target_modules all-linear
--freeze_vit true
--gradient_accumulation_steps $(expr 32 / $nproc_per_node)
--eval_steps 100
--save_steps 200
--save_total_limit 10
--logging_steps 4
--max_length 2048
--output_dir /data/xiangxi/tmp/Omni/output_img_2w_cmd_distill_6_9_tmp
--warmup_ratio 0.05
--max_new_tokens 1
--dataloader_num_workers 8
--deepspeed zero2

基于上面的训练,会在模型加载后就卡住,卡住地方打印的log如下:
[INFO:swift] Successfully registered post_encode hook: ['Qwen2_5OmniForConditionalGeneration'].
Gradient accumulation steps mismatch: GradientAccumulationPlugin has 1, DeepSpeed config has 4. Using DeepSpeed's value.

显卡占用情况如下:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:1A:00.0 Off | Off |
| 30% 34C P8 21W / 300W | 10480MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA RTX A6000 Off | 00000000:1B:00.0 Off | Off |
| 30% 51C P2 98W / 300W | 448MiB / 49140MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA RTX A6000 Off | 00000000:3D:00.0 Off | Off |
| 30% 45C P2 97W / 300W | 432MiB / 49140MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA RTX A6000 Off | 00000000:3E:00.0 Off | Off |
| 30% 52C P2 98W / 300W | 448MiB / 49140MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA RTX A6000 Off | 00000000:88:00.0 Off | Off |
| 30% 50C P2 107W / 300W | 432MiB / 49140MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA RTX A6000 Off | 00000000:89:00.0 Off | Off |
| 30% 53C P2 100W / 300W | 448MiB / 49140MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA RTX A6000 Off | 00000000:B1:00.0 Off | Off |
| 30% 56C P2 108W / 300W | 432MiB / 49140MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA RTX A6000 Off | 00000000:B2:00.0 Off | Off |
| 30% 47C P2 99W / 300W | 448MiB / 49140MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

请问这种问题应该怎么解?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions