Hacky tensor parallel for Vision Models #791
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I was frustrated with TP not working on pixtral-large (its slower than qwen235b) and messed around a little bit. The model still sees images and generates text. No obvious side effects were observed, but I will test more along with other arch. Torch asserts because stuff is in inference mode and this bypasses the check with a regular copy. Somehow it's also faster by like .10 t/s.
I will also try it on qwen-VL at some point. Mainly this is here for anyone else who likes to chat with memes but wants higher speeds. Still have to test on long context too. I only did a handful of images. Maybe at 32k ctx it blows up or goes OOM. Everyone feel free to tell my why this is a horrible idea :P
update: I have used up to 20k context on pixtral and have tested qwen2 VL 72b. It's working as well. 1MB images eat your context, who would have thought...