Skip to content

model : add hunyuan moe #14425

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from
Draft

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Jun 27, 2025

Fix #14415

STILL WIP

TODO:

  • convertible to GGUF
  • correct tokenizer / pretok
  • implement cgraph
  • profit

@github-actions github-actions bot added the python python script changes label Jun 27, 2025
Comment on lines +6418 to +6424
for token, rank in mergeable_ranks.items():
vocab[QwenModel.token_bytes_to_string(token)] = rank
if len(token) == 1:
continue
merged = QwenModel.bpe(mergeable_ranks, token, max_rank=rank)
if len(merged) == 2:
merges.append(' '.join(map(QwenModel.token_bytes_to_string, merged)))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quite doubt if this is correct. if someone knows or has time to do tokenizer test, please feel free to leave a comment

@ngxson
Copy link
Collaborator Author

ngxson commented Jun 27, 2025

Ok, getting somewhere now. The model runs, but output gibberish

[UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧]

@ubergarm
Copy link

Thanks for working on this!

I got the same looking output trying llama-server on ngxson/xsn/hunyuan-moe@51886a47a with the freshly converted bf16.

The only odd things I noticed were:

  1. I had to pip install tiktoken to get it to convert
  2. Conversion had an odd warning WARNING:gguf.vocab:Adding merges requested but no merges found, output may be non-functional.
  3. On startup llama-server printed this warning:
load: control-looking token: 127957 '<|endoftext|>' was not control-type; this is probably a bug in the model. its type will be overridden
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect

Tested on an AMD 7965WX 24x Core 256GB DDR5@4800 + Dual RTX A6000 (96GB Total VRAM) rig.

👈 a few more commands and logs fwiw

convert

python \
    convert_hf_to_gguf.py \
    --outtype bf16 \
    --split-max-size 50G \
    --outfile /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/ \
    /mnt/raid/models/tencent/Hunyuan-A13B-Instruct/

...

llama-server

model=/mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-BF16-00001-of-00004.gguf

./build/bin/llama-server \
  --model "$model" \
  -fa \
  -ctk f16 -ctv f16 \
  -c 8192 \
  -ts 48,48 \
  -ngl 10 \
  --threads 24 \
  --host 127.0.0.1 \
  --port 8080

...

client

>>> User:

Tell a funny joke in English.

>>> Assistant:

[UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧]

@arch-btw
Copy link
Contributor

arch-btw commented Jun 27, 2025

I don't know as much about this as you guys but, could it be that the tokenizer is splitting characters like 新 ("new") into raw bytes?

So the UTF-8 sequence 0xe696b0 becomes 3 separate bytes (e6, 96, b0). And the other character 旧 ("old") splits into 3 bytes as well (e6, 97, a7).

And so the fragments get wrapped in [UNK_BYTE_] prefixes. The token stream becomes corrupt in the output and sort of traps the model in a "new --> old" loop, which then blocks normal text generation?

Because common Chinese characters always use 3 bytes in UTF-8:

  • converts to b'\xe6\x96\xb0' (3 bytes)
  • converts to b'\xe6\x97\xa7' (3 bytes)

It matches the error: [UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧]

@ngxson
Copy link
Collaborator Author

ngxson commented Jun 27, 2025

The cgraph is still not correct. Testing with this tiny random weight: https://huggingface.co/ngxson/hunyuan-moe-tiny-random/tree/main

Seems like the problem is from the self-attention block

@kooshi
Copy link

kooshi commented Jun 28, 2025

I don't know if the improvements I am seeing are from your last wip commit, or from my edits to the convert script, but I currently get almost intelligible responses.

The changes I made were:

  • specify the BOS token explicitly, as it is incorrect in hunyuan's config.json self.gguf_writer.add_bos_token_id(127959)
  • use tokenizer.special_tokens.values() instead of tokenizer.get_added_vocab() to determine control tokens
  • skip lm_head.weight as the embedding weights are tied
  • changed the base model from LlamaModel to TextModel for a more generic foundation

my edits are here: https://github.com/kooshi/llama.cpp/tree/hunyuan
full disclaimer though, I have no idea what I'm doing. The BOS token was definitely broken though.

> hello
<think>[UNK_BYTE_0x0a>
]Okay,[UNK_BYTE_0x20 the]the[UNK_BYTE_0x20 user]user[UNK_BYTE_0x20 said]said[UNK_BYTE_0x20 "]"hello".[UNK_BYTE_0x20 I]I[UNK_BYTE_0x20 need]need[UNK_BYTE_0x20 to]to[UNK_BYTE_0x20 respond]respond[UNK_BYTE_0x20 appropriately]appropriately.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]First,[UNK_BYTE_0x20 hello]hello.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hello.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hello[UNK_BYTE_0x20 there]there![UNK_BYTE_0x0a!

][UNK_BYTE_0x0a!

]Hi[UNK_BYTE_0x20 there]there.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hello.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hello.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hi.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hello.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hi.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hey.[UNK_BYTE_0x0a.

(continues forever)

@ngxson
Copy link
Collaborator Author

ngxson commented Jun 28, 2025

The more looking at the upstream implementation, the more I wonder if it actually works.

My Mac M3 Ultra can't load the original model even though having 512GB of RAM.

Now, testing with the tiny weight. Switching between eager and sdpa, they give different output result, which indicates that one of the 2 attn impl is buggy.

Also, flash_attn does not work at all, they haven't even verified the code path before shipping (NameError: name 'flash_attn_func' is not defined)

And more importantly, attention_mask is None everywhere, even using the example code provided on HF.

If that is true, it means they messed up badly this time.

@Downtown-Case
Copy link

modeling_hunyuan.py is basically identical to the file for the old hunyuan-large, with 1 changed line:

https://www.diffchecker.com/P3e0hQM5/

https://huggingface.co/tencent/Tencent-Hunyuan-Large/blob/main/Hunyuan-A52B-Instruct/

And hunyuan.py (the actual model class here) is largly copied from modeling_hunyuan.py, including unused features like CLA:

https://www.diffchecker.com/P9FIR5OD/

In other words, its almost Hunyuan large? I'm not sure why the HF attention implementations would be bugged. But other reimplementations like vllm's seem to work, so maybe they can shed some light on this:

quinnrong94/vllm@5302fbf

@Downtown-Case
Copy link

Downtown-Case commented Jun 28, 2025

I take that back, apparently vllm is only sometimes working with A13B, heh:

ikawrakow/ik_llama.cpp#561 (comment)

vllm-project/vllm#20183

vllm-project/vllm#20114

@Noeda
Copy link
Contributor

Noeda commented Jun 28, 2025

I had the original model from Huggingface work coherently on pure CPU. It uses the HunYuanSdpaAttention codepath.

This is all tentative as I just got it running at all:

If I compare logits for a single-token prompt, I get a very similar logit distribution from both llama.cpp and the HF. More than one token and things look different. I'm purely going with numerical token IDs for llama.cpp as the tokenizer is messed up as observed (I tried 'a' the token 64 for single-token prompt and '12' prompt (16, 17) for two-token test, e.g. llama-eval-callback --no-escape --model hunyuan-q8.gguf -n 1 -c 512 -p '12').

This is with the code from combined @ngxson and @kooshi with the .gguf made with @kooshi 's code (I took latest efforts I saw here in the discussion to start off).


Below in the dropbox is the transformers test program that makes coherent text for me (up to 100 tokens because I was too impatient to try longer prompts). I think installing accelerate and asking it to use bfloat16 really helps with memory. I think that would make it run on the M3 512GB machine too, IIRC when I did this for dots.llm1 I really had to use bfloat16 to not run out of memory.

My machine has 256GB of memory, a Hetzner server with a modern AMD EPYC CPU. I do have a Mac Studio (M2, 192GB) as well but for CPU work this Hetzner is usually much faster.

(I don't know why asking it to use bfloat16 helps, maybe it doesn't make giant copies of tensors or something when you ask it to do that; it's just something I observed and never checked what's it doing behind the scenes).

test.py

This is a version of the example code from the Huggingface page that I modified a bit.

#!/usr/bin/env python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os
import re

def main():
    with torch.no_grad():
        model_path = '/home/shannon/llama.cpp/tencent_Hunyuan-A13B-Instruct'

        tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True, trust_remote_code=True)
        model = AutoModelForCausalLM.from_pretrained(model_path, local_files_only=True, device_map="cpu", torch_dtype=torch.bfloat16, trust_remote_code=True)

        messages = [
            {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
        ]
        tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt",
                                                          enable_thinking=True # Toggle thinking mode (default: True)
                                                      )

        outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=20)
        output_text = tokenizer.decode(outputs[0])
        print(outputs)
        print(output_text)


if __name__ == '__main__':
    main()
stdout of test.py

The output has output as token IDs and as text (two prints()) in there. To run this, you need to install accelerate into your Python environment for the device_map line thingy.

(hunyuan) shannon@soga ~/hunyuan_llama.cpp/hf> ./test.py
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  6.09it/s]
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
tensor([[127958,   8144,    264,   2875,  12399,    315,    279,   7720,    315,
           5912,  10368, 127962,  14023,    771,    397,  33413,     11,    358,
           1205,    311,   3350,    264,   2875,  12399,    922,    279,   7720,
            315,   5912,  10368,     13,   6914]])
<|startoftext|>Write a short summary of the benefits of regular exercise<|extra_0|><think>
Okay, I need to write a short summary about the benefits of regular exercise. Let

I'm on and off this weekend trying to also figure out where computation graph is off exactly. If I find out before someone else does, I'll let you all know.

(Runs surprisingly fast on transformers+CPU, I'm used to that combo being extraordinarily slow. It is still very slow, just not like "it will take 30 minutes to make 10 tokens" slow).

@jacekpoplawski
Copy link

Is it possible to load this model in 4-bit precision using Transformers? Does bitsandbytes support this model? I’m limited to a total of 72GB of VRAM across several GPUs, so bfloat16 won’t work for me.

@ubergarm
Copy link

ubergarm commented Jun 28, 2025

@jacekpoplawski

Is it possible to load this model in 4-bit precision using Transformers? Does bitsandbytes support this model? I’m limited to a total of 72GB of VRAM across several GPUs, so bfloat16 won’t work for me.

Their official inference script for running the int4 quant on vllm is using --dtype bfloat16

(still didn't work for me though)

@Noeda
Copy link
Contributor

Noeda commented Jun 28, 2025

To add to @ubergarm options, I did notice there are some quantized versions like https://huggingface.co/tencent/Hunyuan-A13B-Instruct-FP8 or https://huggingface.co/tencent/Hunyuan-A13B-Instruct-GPTQ-Int4 (they look like they are designed to work with transformers at first glance. I've never in my entire life ran vLLM or sglang even once.).

The GPTQ-Int4 one has a single model.safetensors at 43.7GB which maybe works. One would hope 😉

Haven't tried any of them. For computation graph work feels better to get whatever is highest precision I am able to run conveniently.

@ngxson
Copy link
Collaborator Author

ngxson commented Jun 28, 2025

If someone can run it, could you please verify if attention_mask inside HunYuanDecoderLayer has a non-Nonevalue? Thanks.

@Noeda
Copy link
Contributor

Noeda commented Jun 28, 2025

(hunyuan) shannon@soga ~/hunyuan_llama.cpp/hf> ./test2.py
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  3.91it/s]
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None

@ngxson is this the part you wanted to see if it's None or not? Argument to the forward()?

Screenshot 2025-06-28 at 12 28 33

Edit: took a bigger screenshot to show more clearly where I put that. HunYuanDecoderLayer's forward(). The line numbers you see won't match with original because I have more print() debugging at the top of the file and other hacky stuff I added.

Stdout tail because that first paste is cut off, I see None throughout the entire run. Output looks coherent.

Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
tensor([[127958,   8144,    264,   2875,  12399,    315,    279,   7720,    315,
           5912,  10368, 127962,  14023,    771,    397,  33413,     11,    358,
           1205,    311,   3350,    264,   2875,  12399,    922,    279,   7720,
            315,   5912,  10368,     13,   6914]])
<|startoftext|>Write a short summary of the benefits of regular exercise<|extra_0|><think>
Okay, I need to write a short summary about the benefits of regular exercise. Let

Edit2: I'm going to let this thing generate a full response which might take a while. But I feel this might be a bit short as a test; it almost verbatim mentions the prompt in the <think> so maybe it's about to repeat itself or something. I'll paste as a new comment when it's done. Just want to get more confirmation the HF implementation itself works beyond very short generations.

@Noeda
Copy link
Contributor

Noeda commented Jun 28, 2025

Full response example of the transformers version; I gave it 5000 token max:

stdout from test2.py (I cut off all the parts that said attention mask is None)
tensor([[127958,   8144,    264,   2875,  12399,    315,    279,   7720,    315,
           5912,  10368, 127962,  14023,    771,    397,  33413,     11,    358,
           1205,    311,   3350,    264,   2875,  12399,    922,    279,   7720,
            315,   5912,  10368,     13,   6914,    757,   1212,    555,  89746,
           1148,    358,   1440,     13,   5629,     11,   7106,   2890,   7720,
             25,  96931,    279,   4851,     11,  36050,  35855,     11,   8779,
            449,   4785,   6373,     13,   5112,  10723,   2890,     25,  26338,
           8631,     11,  18547,     11,  18710,     13,  10926,   1101,  67232,
           4907,   5990,     13,   8840,     11,    323,   1317,   9860,   6392,
           1093,  18189,   5326,    315,  21249,  19338,   2345,   8747,  16629,
             11,  63308,     11,   1063,  51423,     13,   8840,     11,    323,
          25702,   7720,     11,   1093,   2731,   5044,    477,   8271,    734,
             13,   6914,    757,  31335,   1521,   3585,    382,   3563,    449,
            459,  17219,    430,   5415,   5912,  10368,    706,  12387,   7720,
             13,   5112,   1464,   1523,   1139,   7106,     11,  10723,     11,
            323,   7344,   1023,  11306,     13,   1789,   7106,     25,   4851,
           2890,    320,   4620,    261,   4851,     11,   4827,   6680,   7410,
            705,   4785,   6373,    320,  22464,     82,  25247,     11,  22890,
          16124,    705,  22852,   1887,    320,  37860,    570,  38895,     25,
            842,  16751,   1354,     11,  26338,   8631,  56592,  16708,     11,
           3698,   1900,  18710,     13,  73235,     25,  57924,   5357,     11,
           5044,     11,   7344,  32174,  25702,  18174,     13,   7429,     11,
           3674,   7720,    422,   1912,  23783,     11,    719,   7344,    430,
            596,  10309,     13,  14998,    311,   2567,    433,  64694,     11,
            779,   7344,    220,     19,     12,     20,   1401,   3585,     13,
          35106,    503,  71921,     13,   7557,   2771,    433,  28555,     13,
           6914,    757,   1817,    422,    358,  13942,   4205,     13,   8840,
             11,   4907,   5990,   2345,  64562,    649,   5376,  61784,     11,
            539,   1120,   8395,  25247,     13,  22335,     11,    430,    596,
           3062,     13,   2100,  63179,    682,   1521,   1139,    264,  56887,
          14646,     13,   6914,    757,  10165,   1473,  31504,  10368,   6209,
            264,   7029,   2134,    315,   7720,    369,   8244,   1664,  33851,
             13,  13101,   2740,     11,    433,  96931,    279,   4851,     11,
          18899,  35855,    323,  46301,    279,   5326,    315,   4787,   1093,
          63308,    323,   4851,   8624,     11,   1418,  86387,    304,   4785,
           6373,   1555,  52703,  20252,    323,  16124,   4857,     13,  49693,
            750,     11,    433,  31854,    279,   4984,    315,    842,  16751,
           1354,     11,  18189,   8631,     11,  18547,     11,    323,  13803,
            315,  18710,     11,    323,  57924,  25702,    734,     11,  56028,
           5357,     11,   5044,     11,    323,  13893,  80430,   4325,  14228,
          10723,  18174,     13,  23212,     11,   5912,   5820,  12231,    988,
           4907,   5990,    555,  18899,  61784,    323,  11815,    264,  16643,
          22852,   1887,     11,  18189,  17563,   5326,     13,  32255,     11,
           1521,   6372,  17210,    311,    264,   5129,     11,  39345,     11,
            323,    810,  24770,   2324,    382,  14524,     11,    374,    430,
           2288,   1317,     30,  10926,  74481,     13,   6914,    757,   1518,
             13,    330,  31504,  10368,   5825,  62387,    582,  25489,   7720,
             13,  13101,   2740,     11,    433,  96931,    279,   4851,     11,
          73115,   6680,   7410,     11,  52797,   4785,   6373,     11,    323,
          67232,  40368,     13,  49693,    750,     11,    433,  19786,    842,
          16751,   1354,     11,  18189,   8631,     11,  18547,     11,    323,
          18710,     11,   1418,  47594,   5357,    323,   5044,     13,   1102,
           1101,  12992,   4907,   5990,    323,   1253,   7781,  25702,  18174,
             13,  28993,     11,    433,  39990,    264,   5129,     11,  39345,
           2324,   1210,   3011,    596,   2731,     13,   4497,  64694,     13,
           4343,    369,  32373,     13,  22335,     11,    430,   4375,     13,
           7557,   2771,    311,   6420,   1401,   5789,   2085,   3794,   2288,
          11944,     13,   3011,   1288,   3504,    433,    627,    524,  27963,
            397,     27,   9399,    397,  31504,  10368,  28421,  28254,   7720,
           4028,   7106,     11,  10723,     11,    323,  25702,  31576,     13,
          13101,   2740,     11,    433,  96931,    279,   4851,     11,  36050,
          35855,     11,    323,  73115,   6680,   7410,     11,  18189,    279,
           5326,    315,   4851,   8624,     11,  63308,     11,    323,  12943,
             13,   1102,  52797,   4785,   6373,    555,  20252,  25247,    323,
           4857,  16025,  16124,     11,   1418,   1101,  47594,  22852,    734,
             13,  49693,    750,     11,  10368,  31854,    842,  16751,    258,
           4984,     11,  46649,  23747,   8631,     11,  18547,     11,    323,
          13803,    315,  18710,     11,    323,  67232,   5357,     11,   5044,
             11,    323,  14604,  56062,     13,   1102,   4726,  12231,    988,
           4907,   5990,    555,  18899,  61784,    323,   1253,   7781,   4325,
          14228,  25702,  18174,     13,  21153,   3210,     11,   1521,   6372,
          12192,    264,   5129,     11,  39345,     11,    323,    810,  24770,
           2324,    627,    524,   9399,     29, 127960]])
<|startoftext|>Write a short summary of the benefits of regular exercise<|extra_0|><think>
Okay, I need to write a short summary about the benefits of regular exercise. Let me start by recalling what I know. First, physical health benefits: strengthens the heart, improves circulation, helps with weight management. Then mental health: reduces stress, anxiety, depression. Maybe also boosts energy levels. Oh, and long-term stuff like reducing risk of chronic diseases—diabetes, hypertension, some cancers. Oh, and cognitive benefits, like better memory or brain function. Let me organize these points.

Start with an introduction that states regular exercise has numerous benefits. Then break down into physical, mental, and maybe other categories. For physical: heart health (stronger heart, lower blood pressure), weight management (burns calories, builds muscle), immune system (maybe). Mental: endorphins, reduces stress/anxiety, combats depression. Cognitive: enhances focus, memory, maybe delays cognitive decline. Also, social benefits if group exercises, but maybe that's optional. Need to keep it concise, so maybe 4-5 key points. Avoid jargon. Make sure it flows. Let me check if I missed anything. Oh, energy levels—exercise can increase stamina, not just burn calories. Yeah, that's important. So summarize all these into a coherent paragraph. Let me draft:

Regular exercise offers a wide range of benefits for overall well-being. Physically, it strengthens the heart, improving circulation and lowering the risk of conditions like hypertension and heart disease, while aiding in weight management through calorie burning and muscle building. Mentally, it triggers the release of endorphins, reducing stress, anxiety, and symptoms of depression, and enhances cognitive function, boosting focus, memory, and potentially delaying age-related mental decline. Additionally, regular activity elevates energy levels by improving stamina and supports a stronger immune system, reducing illness risk. Together, these effects contribute to a longer, healthier, and more balanced life.

Wait, is that too long? Maybe shorten. Let me see. "Regular exercise provides multifaceted benefits. Physically, it strengthens the heart, lowers blood pressure, aids weight management, and boosts immunity. Mentally, it releases endorphins, reducing stress, anxiety, and depression, while enhancing focus and memory. It also increases energy levels and may delay cognitive decline. Overall, it promotes a longer, healthier life." That's better. More concise. Check for clarity. Yeah, that works. Make sure to mention key areas without getting too detailed. That should cover it.
</think>
<answer>
Regular exercise delivers profound benefits across physical, mental, and cognitive domains. Physically, it strengthens the heart, improves circulation, and lowers blood pressure, reducing the risk of heart disease, hypertension, and stroke. It aids weight management by burning calories and building lean muscle, while also enhancing immune function. Mentally, exercise triggers endorphin release, alleviating stress, anxiety, and symptoms of depression, and boosts focus, memory, and emotional resilience. It further elevates energy levels by improving stamina and may delay age-related cognitive decline. Collectively, these effects promote a longer, healthier, and more balanced life.
</answer><|eos|>

Code is almost same as before, pasting for reproducibility:

test2.py
#!/usr/bin/env python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os
import re

def main():
    with torch.no_grad():
        model_path = '/home/shannon/llama.cpp/tencent_Hunyuan-A13B-Instruct'

        tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True, trust_remote_code=True)
        model = AutoModelForCausalLM.from_pretrained(model_path, local_files_only=True, device_map="cpu", torch_dtype=torch.bfloat16, trust_remote_code=True)

        messages = [
            {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
        ]
        tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt",
                                                          enable_thinking=True # Toggle thinking mode (default: True)
                                                      )

        outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=5000)
        output_text = tokenizer.decode(outputs[0])
        print(outputs)
        print(output_text)


if __name__ == '__main__':
    main()

The output looks normal to me and it answered the prompt. It does look like to me it works.

CPU-only, 256GB Hetzner server.

@ngxson
Copy link
Collaborator Author

ngxson commented Jun 28, 2025

Ok so based on investigations above, it seems like HunYuanSdpaAttention gives the correct result on pytorch version. It seems like attention_mask is irrelevant for sdpa so we can ignore it for now.

I'm now getting 100% logits match between llama.cpp <> sdpa using the random weight. But not sure why the official weight still doesn't work correctly.

@Noeda
Copy link
Contributor

Noeda commented Jun 28, 2025

I pulled that last commit to the amalgamation I had and I'm getting <think> tags and some coherence this time. Tokenization is still messed up, so I'll turn my attention to what's going on over there. Maybe the computation graph itself is now fine but we'll find out.

I don't know if you tried to ran official weights on a Metal GPU but my experience has been that there's been a whole lot of bugs in the MPS/Metal backend in torch. Things like silu() literally doing nothing at all, tensors silently turning into garbage from the tail end if they are too large etc. It's got better over time, but I still have an instinct to assume it's a Metal bug if something in transformers does not work on first try. GLM-4 IIRC just recently has a bug in transformers if you try to run in on Metal at context lengths +10k or so. This is why I try to do these kinds of comparison tests on CPU whenever practical.

@ngxson
Copy link
Collaborator Author

ngxson commented Jun 28, 2025

I'm now getting 100% logits match between llama.cpp <> sdpa using the random weight. But not sure why the official weight still doesn't work correctly.

Small correction: I only got logits matched for the first 2 tokens. From 3rd token, things start to go crazy.

So definitely problem with RoPE

Meme taken from my blog post:

@Code42Cate
Copy link

Full response example of the transformers version; I gave it 5000 token max:

stdout from test2.py (I cut off all the parts that said attention mask is None)
Code is almost same as before, pasting for reproducibility:

test2.py
The output looks normal to me and it answered the prompt. It does look like to me it works.

CPU-only, 256GB Hetzner server.

That exact code snippet also worked for me on a B200 GPU (with device_map="auto" ofc). Not sure if that helps :)

@kooshi
Copy link

kooshi commented Jun 28, 2025

I've been messing with tokenization all day.
The UNK_BYTE is triggered when the tokens contain bytes outside of the very small set available in the gpt2 decoder, defined here: unicode_utf8_to_byte_map.
The HY tokenizer looks very similar to the original qwen tokenizer, so I tried using
vocab[QwenModel.token_bytes_to_string(token)] = rank
and
reverse_vocab = {id_ : encoded_tok for encoded_tok, id_ in {**vocab, **special_token_ids}.items()}

in the same way Qwen does, but that obliterated the coherence I had seen before. I'll pull the new cpp code and keep digging.

@ngxson
Copy link
Collaborator Author

ngxson commented Jun 28, 2025

RoPE is fixed. However, new problem appear:

It seems like some engineers at Tencent think that they should make their top-k MoE selection a bit "special"

And by "special", I mean this block of code, which seems to be harmless at first. In short, what is does is to keep track of the usage for each expert. If an expert is used too much (i.e. exceed capacity), it will be "de-prioritized". I assume this is to fix the problem where MoE router is extremely hard to train (ref: Qwen3MoE has some "unused" experts)

Sounds like a good idea, but this is extremely difficult to reimplement in llama.cpp

This also makes the number of experts used by a given token become uneven. Some tokens will use less experts than the other, some use no experts (due to the priority explained above). That sounds good on the surface, but the actual implementation always calculate fixed number of experts per token - which defeat the whole point. I'm now confident that Tencent messed up this time.

@kooshi
Copy link

kooshi commented Jun 29, 2025

Woo! Tokenization is working, and I get a whopping 75 tps in q4_0 on my gpus.
still needs a couple tweaks for some warnings, but code is here: https://github.com/kooshi/llama.cpp/tree/hunyuan

> Hello
<think>
Okay, the user said "Hello". That's a simple greeting. I should respond in a friendly and welcoming manner.

Maybe start with a greeting like "Hello!" or "Hi there!". Then, offer assistance by asking how I can help. Keep the tone positive and approachable.

Let me check if there's anything else needed. The user just greeted, so a simple response should suffice. Make sure to avoid any markdown formatting and keep the language natural.
</think>
<answer>
Hello there!
</answer>

Edit: It still devolves into repetition pretty quickly though 😔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Hunyuan-A13B model support
9 participants