Eval bug: Out of Memory Error with Qwen2-VL on Windows #10973

AmineM24 · 2024-12-25T12:29:15Z

Name and Version

version: 4391 (9ba399d)
built with MSVC 19.29.30157.0 for

Operating systems

Windows

GGML backends

CPU, CUDA

Hardware

CPU : intel core i7 13850 HX
GPU : RTX 3500 ADA
RAM : 32 GB

Models

Qwen2-VL-7B-instruct-Q4_K_M.gguf
bartowski/Qwen2-VL-7B-Instruct-GGUF

Problem description & steps to reproduce

I tried running the following command on Windows using both the AVX2 and CUDA binaries downloaded from the releases

llama-qwen2vl-cli -m Qwen2-VL-7B-Instruct-Q4_K_M.gguf --mmproj mmproj-Qwen2-VL-7B-Instruct-f32.gguf -p "Describe this image" --image "facture_1.png"

This is for CUDA :

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 937664.01 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 983211975680
D:\a\llama.cpp\llama.cpp\ggml\src\ggml-backend.cpp:262: GGML_ASSERT(buf != NULL && "tensor buffer not set") failed

It was also The same for the CPU version.
I tried using different options like lowering batch size (-b) or context size (-c), but it still crashes with the same error

First Bad Commit

No response

Relevant log output

C:\Users\ouqas\cuda-llama>llama-qwen2vl-cli -m Qwen2-VL-7B-Instruct-Q4_K_M.gguf --mmproj mmproj-Qwen2-VL-7B-Instruct-f32.gguf -p "Describe this image" --image "facture_1.png"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX 3500 Ada Generation Laptop GPU, compute capability 8.9, VMM: yes
build: 4337 (160bc039) with MSVC 19.29.30157.0 for
llama_load_model_from_file: using device CUDA0 (NVIDIA RTX 3500 Ada Generation Laptop GPU) - 11117 MiB free
llama_model_loader: loaded meta data with 37 key-value pairs and 339 tensors from Qwen2-VL-7B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2vl
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2 VL 7B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2-VL
llama_model_loader: - kv   5:                         general.size_label str              = 7B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen2 VL 7B
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2-VL-7B
llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["multimodal", "image-text-to-text"]
llama_model_loader: - kv  12:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  13:                        qwen2vl.block_count u32              = 28
llama_model_loader: - kv  14:                     qwen2vl.context_length u32              = 32768
llama_model_loader: - kv  15:                   qwen2vl.embedding_length u32              = 3584
llama_model_loader: - kv  16:                qwen2vl.feed_forward_length u32              = 18944
llama_model_loader: - kv  17:               qwen2vl.attention.head_count u32              = 28
llama_model_loader: - kv  18:            qwen2vl.attention.head_count_kv u32              = 4
llama_model_loader: - kv  19:                     qwen2vl.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  20:   qwen2vl.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  21:                          general.file_type u32              = 15
llama_model_loader: - kv  22:            qwen2vl.rope.dimension_sections arr[i32,4]       = [16, 24, 24, 0]
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["─á ─á", "─á─á ─á─á", "i n", "─á t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {% set image_count = namespace(value=...
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                      quantize.imatrix.file str              = /models_out/Qwen2-VL-7B-Instruct-GGUF...
llama_model_loader: - kv  34:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  35:             quantize.imatrix.entries_count i32              = 196
llama_model_loader: - kv  36:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: special tokens cache size = 14
llm_load_vocab: token to piece cache size = 0.9309 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2vl
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 28
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18944
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 8
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 7.62 B
llm_load_print_meta: model size       = 4.36 GiB (4.91 BPW)
llm_load_print_meta: general.name     = Qwen2 VL 7B Instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 '├ä─¼'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/29 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size =  4460.45 MiB
........clip_model_load: model name:   Qwen2-VL-7B-Instruct
clip_model_load: description:  image encoder for Qwen2VL
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    521
clip_model_load: n_kv:         20
clip_model_load: ftype:        f32

........clip_model_load: loaded meta data with 20 key-value pairs and 521 tensors from mmproj-Qwen2-VL-7B-Instruct-f32.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                        general.description str              = image encoder for Qwen2VL
clip_model_load: - kv   2:                          general.file_type u32              = 0
clip_model_load: - kv   3:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   4:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   5:                    clip.has_qwen2vl_merger bool             = true
clip_model_load: - kv   6:                        clip.projector_type str              = qwen2vl_merger
clip_model_load: - kv   7:                              clip.use_silu bool             = false
........................clip_model_load: - kv   8:                              clip.use_gelu bool             = false
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:                     clip.vision.image_size u32              = 560
clip_model_load: - kv  11:               clip.vision.embedding_length u32              = 1280
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 3584
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 32
clip_model_load: - kv  16:            clip.vision.feed_forward_length u32              = 0
clip_model_load: - kv  17:                               general.name str              = Qwen2-VL-7B-Instruct
clip_model_load: - kv  18:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv  19:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_model_load: - type  f32:  521 tensors
............................................
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  0
clip_model_load: minicpmv_projector:  0
clip_model_load: model size:     2577.82 MB
clip_model_load: metadata size:  0.18 MB
clip_model_load: params backend buffer size =  2577.82 MB (521 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 198.93 MB
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init:        CPU KV buffer size =   224.00 MiB
llama_new_context_with_model: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   744.25 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    15.01 MiB
llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 396 (with bs=512), 1 (with bs=1)
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 937664.01 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 983211975680
D:\a\llama.cpp\llama.cpp\ggml\src\ggml-backend.cpp:262: GGML_ASSERT(buf != NULL && "tensor buffer not set") failed

The text was updated successfully, but these errors were encountered:

l29ah · 2024-12-26T01:33:20Z

What are the dimensions of the image you're using?

chaoszzz123 · 2024-12-26T06:28:31Z

I met the same error

AmineM24 · 2024-12-26T13:21:49Z

What are the dimensions of the image you're using?

It seems like the issue was indeed related to the image resolution.
The first image I tested had dimensions of 4132 x 5858 pixels, which triggered the error.
When I tested with an image of 1131 x 1600 pixels, it worked (although it was slow, taking almost 3 minutes to generate a response).
For reference, I was using the CPU backend.
Is an increased image resolution really that problematic? Does it consume so much RAM to the point that it exceeds even the model’s memory requirements?
Thanks in advance

chaoszzz123 · 2024-12-26T14:04:19Z

What are the dimensions of the image you're using?

It seems like the issue was indeed related to the image resolution. The first image I tested had dimensions of 4132 x 5858 pixels, which triggered the error. When I tested with an image of 1131 x 1600 pixels, it worked (although it was slow, taking almost 3 minutes to generate a response). For reference, I was using the CPU backend. Is an increased image resolution really that problematic? Does it consume so much RAM to the point that it exceeds even the model’s memory requirements? Thanks in advance

Actually it‘s the image resolution problem， for qwen2-vl model’s vision part can handle arbitrary resolution images， which means more tokens are projected to the LLM part. According to the paper, an image with 4132x5858 pixels means (4132x5858)/(14x14)/4= 30874 tokens, which demands very large RAM and causes very slow inference speed.

AmineM24 added the bug-unconfirmed label Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: Out of Memory Error with Qwen2-VL on Windows #10973

Eval bug: Out of Memory Error with Qwen2-VL on Windows #10973

AmineM24 commented Dec 25, 2024

l29ah commented Dec 26, 2024

chaoszzz123 commented Dec 26, 2024

AmineM24 commented Dec 26, 2024

chaoszzz123 commented Dec 26, 2024

Eval bug: Out of Memory Error with Qwen2-VL on Windows #10973

Eval bug: Out of Memory Error with Qwen2-VL on Windows #10973

Comments

AmineM24 commented Dec 25, 2024

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

l29ah commented Dec 26, 2024

chaoszzz123 commented Dec 26, 2024

AmineM24 commented Dec 26, 2024

chaoszzz123 commented Dec 26, 2024