[BUG] Exllamav2 quickly devolves into endless repetition in versions newer than 2.8.0

### OS

Linux

### GPU Library

AMD ROCm

### Python version

3.12

### Pytorch version

2.8.0

### Model

dakkidaze/Cydonia-22B-v1.3-4.5bpw-h6-exl2

### Describe the bug

Exllamav2 versions newer than 2.8.0 seem to very quickly start generating nonsense. 

Example I got (both with the model I wrote above, exactly the same request):
Old version (0.2.8):
```
Once upon a time, in a lush green meadow, there lived a curious little rabbit named Benny. Benny loved exploring the meadow, hopping from one patch of clover to another, nibbling on the sweet leaves. One sunny morning, as Benny was enjoying his breakfast, he noticed a sleek black cat lounging under a nearby tree. (it goes on to write several paragraphs of coherent text)
```

New version (0.3.1):
```
Once upon a time, a cat named Tom and a rabbit named named named named (continues to write the same word until the length limit is hit)
```

Regarding models: This bug seems to work differently depending on the model (and I'm not sure whether it impacts all or only Mistral 22B) Mistral 22B-based finetunes (I tested several) start to repeat the same word, while Llama 3 8B seems to return a generally coherent text most of the time (with appropriate instruct template). Rocinante 12B is also unaffected



### Reproduction steps

Clone TabbyAPI into 2 folders with different names. 

Inside both, create a venv and clone exllamav2.

Install python via 
`pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3` 
(I tested on a system with rocm 6.4 installed, but torch built for 6.3 seems to be compatible with 6.4, and if my memory serves me well, this issue was also present on 6.3 and 6.2.4 at least)

In repo 1 (old):
For TabbyAPI, checkout commit 3960612d38b231017cd72e5fd19db855fe3bd371
in exllama, checkout v2.8.0
build both via `pip install .` (exllamav2 then tabby)

In repo 2 (new):
Simply build both via `pip install .` (exllamav2 then tabby)

Launch the old one, generate a response with some deterministic preset (I used SillyTavern), close.
Launch the new one, generate a response with the same prompt and preset.

I encountered the bug several times when trying to move to newer versions to exllama, 2.8.0 is the latest version I'm confident works fine.


### Expected behavior

In both cases, a coherent result should be generated

### Logs

_No response_

### Additional context

_No response_

### Acknowledgements

- [x] I have looked for similar issues before submitting this one.
- [x] I understand that the developers have lives and my issue will be answered when possible.
- [x] I understand the developers of this program are human, and I will ask my questions politely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BUG] Exllamav2 quickly devolves into endless repetition in versions newer than 2.8.0 #793

OS

GPU Library

Python version

Pytorch version

Model

Describe the bug

Reproduction steps

Expected behavior

Logs

Additional context

Acknowledgements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[BUG] Exllamav2 quickly devolves into endless repetition in versions newer than 2.8.0 #793

Description

OS

GPU Library

Python version

Pytorch version

Model

Describe the bug

Reproduction steps

Expected behavior

Logs

Additional context

Acknowledgements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions