Description
Describe the bug
Hi, I tried to GRPO training with sequence parallelism using 4 gpus. During training, the process terminates unexpectedly with the following warning:
WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
This issue only occurs when using sequence parallelism. When running multi-GPU training without sequence parallelism, the training completes without any such problem.
Additionally, a warning message that did not appear before has started showing up only when sequence parallelism is enabled:
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
This deprecation warning also does not appear when sequence parallelism is disabled.

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
4 A100 GPU
ms-swft == 3.6.0
transformer == 4.51.3
vllms == 0.8.5.post1
trl == 0.18.1
torch == 2.6.9
Additional context
Add any other context about the problem here(在这里补充其他信息)
command I used :
NPROC_PER_NODE=4
swift rlhf
--rlhf_type grpo
--model Qwen/Qwen2.5-7B
--external_plugins MY_PATH \
--reward_funcs MY_REWARD
--train_type full
--loss_type bnpo
--torch_dtype bfloat16
--dataset MY_PATH
--max_length 512
--max_completion_length 512
--num_train_epochs 1
--seed 42
--per_device_train_batch_size 2
--per_device_eval_batch_size 8
--gradient_accumulation_steps 4
--learning_rate 1e-6
--temperature 0.9
--warmup_ratio 0.05
--max_grad_norm 0.2
--temperature 0.9
--save_strategy="steps"
--save_steps 250
--save_total_limit 20
--logging_steps 1
--dataloader_num_workers 4
--num_generations 8
--system 'You are a helpful assistant.'
--deepspeed zero3_offload
--log_completions true
--report_to wandb
--num_iterations 1
--use_hf 1
--split_dataset_ratio 0
--use_vllm true
--vllm_mode colocate
--vllm_gpu_memory_utilization 0.5
--vllm_max_model_len 512
--vllm_tensor_parallel_size 4
--attn_impl flash_attn
--offload_optimizer true
--offload_model true
--sequence_parallel_size 4
--gc_collect_after_offload true
--dataloader_drop_last true
--sleep_level 1