Skip to content

Some models like gemma-3n crashes - rocBLAS error: Cannot read /opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary.dat: No such file or directory for GPU arch : gfx1036 #14421

Open
@grigio

Description

@grigio

Some models works, some models crashes

llama-cli -m /models/gemma-3n-E4B-it-Q4_K_M.gguf --jinja --single-turn -sys "You are a helpful assistant" -p "Hello"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1036 (0x1036), VMM: no, Wave Size: 32
build: 1 (f667f1e) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 23642 MiB free
llama_model_loader: loaded meta data with 41 key-value pairs and 847 tensors from /models/gemma-3n-E4B-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3n
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma-3N-E4B-It
llama_model_loader: - kv   3:                           general.finetune str              = 3n-E4B-it
llama_model_loader: - kv   4:                           general.basename str              = Gemma-3N-E4B-It
llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   6:                         general.size_label str              = 6.9B
llama_model_loader: - kv   7:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   8:                     gemma3n.context_length u32              = 32768
llama_model_loader: - kv   9:                   gemma3n.embedding_length u32              = 2048
llama_model_loader: - kv  10:                        gemma3n.block_count u32              = 35
llama_model_loader: - kv  11:                gemma3n.feed_forward_length u32              = 16384
llama_model_loader: - kv  12:               gemma3n.attention.head_count u32              = 8
llama_model_loader: - kv  13:   gemma3n.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:               gemma3n.attention.key_length u32              = 256
llama_model_loader: - kv  15:             gemma3n.attention.value_length u32              = 256
llama_model_loader: - kv  16:                     gemma3n.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  17:           gemma3n.attention.sliding_window u32              = 512
llama_model_loader: - kv  18:            gemma3n.attention.head_count_kv u32              = 2
llama_model_loader: - kv  19:                   gemma3n.altup.active_idx u32              = 0
llama_model_loader: - kv  20:                   gemma3n.altup.num_inputs u32              = 4
llama_model_loader: - kv  21:   gemma3n.embedding_length_per_layer_input u32              = 256
llama_model_loader: - kv  22:         gemma3n.attention.shared_kv_layers f32              = 15.000000
llama_model_loader: - kv  23:          gemma3n.activation_sparsity_scale arr[f32,35]      = [1.644854, 1.644854, 1.644854, 1.6448...
llama_model_loader: - kv  24:   gemma3n.attention.sliding_window_pattern arr[bool,35]     = [true, true, true, true, false, true,...
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,262144]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  29:                      tokenizer.ggml.scores arr[f32,262144]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  30:                  tokenizer.ggml.token_type arr[i32,262144]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 106
llama_model_loader: - kv  33:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  36:               tokenizer.ggml.add_sep_token bool             = false
llama_model_loader: - kv  37:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  38:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  39:               general.quantization_version u32              = 2
llama_model_loader: - kv  40:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  422 tensors
llama_model_loader: - type q4_K:  282 tensors
llama_model_loader: - type q6_K:   35 tensors
llama_model_loader: - type bf16:  108 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 3.94 GiB (4.93 BPW) 
load: special tokens cache size = 6414
load: token to piece cache size = 1.9446 MB
print_info: arch             = gemma3n
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 2048
print_info: n_layer          = 35
print_info: n_head           = 8
print_info: n_head_kv        = 2
print_info: n_rot            = 256
print_info: n_swa            = 512
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 1.0e+00
print_info: n_ff             = 16384
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = E4B
print_info: model params     = 6.87 B
print_info: general.name     = Gemma-3N-E4B-It
print_info: vocab type       = SPM
print_info: n_vocab          = 262144
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 106 '<end_of_turn>'
print_info: EOT token        = 106 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 248 '<0x0A>'
print_info: EOG token        = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/36 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  4034.52 MiB
........................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     1.00 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified:        CPU KV buffer size =    32.00 MiB
llama_kv_cache_unified: size =   32.00 MiB (  4096 cells,   4 layers,  1 seqs), K (f16):   16.00 MiB, V (f16):   16.00 MiB
llama_kv_cache_unified_iswa: creating     SWA KV cache, size = 1024 cells
llama_kv_cache_unified:        CPU KV buffer size =    32.00 MiB
llama_kv_cache_unified: size =   32.00 MiB (  1024 cells,  16 layers,  1 seqs), K (f16):   16.00 MiB, V (f16):   16.00 MiB
llama_context:      ROCm0 compute buffer size =   940.00 MiB
llama_context:  ROCm_Host compute buffer size =    31.51 MiB
llama_context: graph nodes  = 3266
llama_context: graph splits = 843 (with bs=512), 6 (with bs=1)
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
<start_of_turn>user
You are a helpful assistant

Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model


system_info: n_threads = 8 (n_threads_batch = 8) / 16 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

sampler seed: 3368873966
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

user
You are a helpful assistant

Hello
model

rocBLAS error: Cannot read /opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary.dat: No such file or directory for GPU arch : gfx1036
 List of available TensileLibrary Files : 
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx1201.dat"
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx942.dat"
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx908.dat"
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx1030.dat"
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx90a.dat"
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx1200.dat"
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx1100.dat"
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx1101.dat"
"/opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx1102.dat"
Aborted (core dumped)

This works

llama-cli -m /models/gemma-3-12b-it-Q4_K_M.gguf --jinja --single-turn -sys "You are a helpful assistant" -p "Hello"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1036 (0x1036), VMM: no, Wave Size: 32
build: 1 (f667f1e) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: llama backend init
...
system_info: n_threads = 8 (n_threads_batch = 8) / 16 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

sampler seed: 115469656
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

user
You are a helpful assistant

Hello
model
Hello! It's nice to hear from you. 😊 

How can I help you today? Let me know what you need! [end of text]


llama_perf_sampler_print:    sampling time =       3.45 ms /    45 runs   (    0.08 ms per token, 13039.70 tokens per second)
llama_perf_context_print:        load time =    1087.73 ms
llama_perf_context_print: prompt eval time =     582.84 ms /    16 tokens (   36.43 ms per token,    27.45 tokens per second)
llama_perf_context_print:        eval time =    5768.43 ms /    28 runs   (  206.02 ms per token,     4.85 tokens per second)
llama_perf_context_print:       total time =    6369.64 ms /    44 tokens

llama-cli --list-devices   
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1036 (0x1036), VMM: no, Wave Size: 32
Available devices:
  ROCm0: AMD Radeon Graphics (23675 MiB, 23642 MiB free)

llama-cli --version     
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1036 (0x1036), VMM: no, Wave Size: 32
version: 1 (f667f1e)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions