Error for Qwen2-VL-2B-Instruct using v3.0.0 #2823

tobiasvanderwerff · 2024-12-11T10:21:42Z

System Info

Ubuntu 22.04.5
Python 3.11
NVIDIA A100
model: Qwen/Qwen2-VL-2B-Instruct

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          Off |   00000001:00:00.0 Off |                    0 |
| N/A   36C    P0             63W /  300W |   72269MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      9984      C   /opt/conda/bin/python3.11                   72256MiB |
+-----------------------------------------------------------------------------------------

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Run this Docker command:

docker run --gpus all --shm-size 1g -p 8080:80 -v $PWD/data:/data \
    -e USE_FLASH_ATTENTION=true \
    ghcr.io/huggingface/text-generation-inference:3.0.0 \
    --model-id Qwen/Qwen2-VL-2B-Instruct

Expected behavior

This fails for v3.0.0, but works for 2.4.1. Output for 3.0.0:

2024-12-11T10:03:46.972045Z  INFO text_generation_launcher: Args {                                                                     
    model_id: "Qwen/Qwen2-VL-2B-Instruct",                                                                                             
    revision: None,                                                                                                                    
    validation_workers: 2,                                                                                                             
    sharded: None,                                                                                                                     
    num_shard: None,                                                                                                                   
    quantize: None,                                                                                                                    
    speculate: None,                                                                                                                   
    dtype: None,                                                                                                                       
    kv_cache_dtype: None,                                                                                                              
    trust_remote_code: false,                                                                                                          
    max_concurrent_requests: 128,                                                                                                      
    max_best_of: 2,                                                                                                                    
    max_stop_sequences: 4,                                                                                                             
    max_top_n_tokens: 5,                                                                                                               
    max_input_tokens: Some(                                                                                                            
        32768,                                                                                                                         
    ),                                                                                                                                 
    max_input_length: None,                                                                                                            
    max_total_tokens: Some(                                                                                                            
        128000,                                                                                                                        
    ),                                                                                                                                 
    waiting_served_ratio: 0.3,                                                                                                         
    max_batch_prefill_tokens: Some(                                                                                                    
        32768,                                                                                                                         
    ),                                                                                                                                 
    max_batch_total_tokens: None,                                                                                                      
    max_waiting_tokens: 20,                                                                                                            
    max_batch_size: None,                                                                                                              
    cuda_graphs: None,                                                                                                                 
    hostname: "30f25418466f",                                                                                                          
    port: 80,                                                                                                                          
    shard_uds_path: "/tmp/text-generation-server",                                                                                     
    master_addr: "localhost",                                                                                                          
    master_port: 29500,                                                                                                                
    huggingface_hub_cache: None,                                                                                                       
    weights_cache_override: None,                                                                                                      
    disable_custom_kernels: false,                                                                                                     
    cuda_memory_fraction: 1.0,                                                                                                         
    rope_scaling: None,                                                                                                                
    rope_factor: None,                                                                                                                 
    json_output: false,                                                                                                                
    otlp_endpoint: None,                                                                                                               
    otlp_service_name: "text-generation-inference.router",                                                                             
    cors_allow_origin: [],                                                                                                             
    api_key: None,                                                                                                                     
    watermark_gamma: None,                                                                                                             
    watermark_delta: None,                                                                                                             
    ngrok: false,                                                                                                                      
    ngrok_authtoken: None,                                                                                                             
    ngrok_edge: None,                                                                                                                  
    tokenizer_config_path: None,                                                                                                       
    disable_grammar_support: false,                                                                                                    
    env: false,                                                                                                                        
    max_client_batch_size: 4,                                                                                                          
    lora_adapters: None,                                                                                                               
    usage_stats: On,                                                                                                                   
    payload_limit: 2000000,                                                                                                            
    enable_prefill_logprobs: false,                                                                                                    
}                                                                                                                                      
2024-12-11T10:03:46.972098Z  INFO hf_hub: Token file not found "/data/token"                                                           
2024-12-11T10:03:48.240902Z  INFO text_generation_launcher: Disabling prefix caching because of VLM model                              
2024-12-11T10:03:48.240923Z  INFO text_generation_launcher: Using attention flashinfer - Prefix caching 0                              
2024-12-11T10:03:48.240947Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]                             
2024-12-11T10:03:48.241043Z  INFO download: text_generation_launcher: Starting check and download process for Qwen/Qwen2-VL-2B-Instruct
2024-12-11T10:03:50.873983Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.                  
2024-12-11T10:03:51.356039Z  INFO download: text_generation_launcher: Successfully downloaded weights for Qwen/Qwen2-VL-2B-Instruct
2024-12-11T10:03:51.356269Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-12-11T10:03:53.865482Z  INFO text_generation_launcher: Using prefix caching = False
2024-12-11T10:03:53.865517Z  INFO text_generation_launcher: Using Attention = flashinfer
2024-12-11T10:04:00.459738Z  INFO text_generation_launcher: Using prefill chunking = False
2024-12-11T10:03:53.865517Z  INFO text_generation_launcher: Using Attention = flashinfer
2024-12-11T10:04:00.459738Z  INFO text_generation_launcher: Using prefill chunking = False
2024-12-11T10:04:00.597656Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-12-11T10:04:00.680759Z  INFO shard-manager: text_generation_launcher: Shard ready in 9.313982844s rank=0
2024-12-11T10:04:00.768525Z  INFO text_generation_launcher: Starting Webserver
2024-12-11T10:04:00.802105Z  INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2024-12-11T10:04:00.921993Z  INFO text_generation_launcher: Using optimized Triton indexing kernels.
2024-12-11T10:04:03.592966Z  INFO text_generation_launcher: KV-cache blocks: 2484579, size: 1
2024-12-11T10:04:03.614039Z  INFO text_generation_launcher: Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1]
2024-12-11T10:04:03.877415Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module> 
    sys.exit(app())
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 321, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 728, in main
    return _main(
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 197, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 703, in wrapper
    return callback(**use_params) 
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 117, in serve
    server.serve(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 315, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.11/site-packages/text_generation_server/interceptor.py", line 24, in intercept
    return await response
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 144, in Warmup
    self.model.warmup(batch, max_input_tokens, max_total_tokens)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 1688, in warmup
    self.cuda_graph_warmup(bs, max_total_tokens, max_total_tokens)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 1495, in cuda_graph_warmup
    self.model.forward(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/qwen2_vl.py", line 536, in forward
    hidden_states = self.text_model(
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 336, in fo
rward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 257, in fo
rward
    attn_output = self.self_attn( 
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/layers/rotary.py", line 57, in forward
    rotary_emb.apply_rotary(q1, q2, cos, sin, q1, q2, False)
RuntimeError: The size of tensor a (16) must match the size of tensor b (32) at non-singleton dimension 0
2024-12-11T10:04:03.877681Z ERROR warmup{max_input_length=Some(32768) max_prefill_tokens=32768 max_total_tokens=Some(128000) max_batch_
size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: The size of tensor a (16) must ma
tch the size of tensor b (32) at non-singleton dimension 0
Error: Backend(Warmup(Generation("The size of tensor a (16) must match the size of tensor b (32) at non-singleton dimension 0")))
2024-12-11T10:04:03.895576Z ERROR text_generation_launcher: Webserver Crashed
2024-12-11T10:04:03.895597Z  INFO text_generation_launcher: Shutting down shards
2024-12-11T10:04:03.985108Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=0
2024-12-11T10:04:03.985343Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0
2024-12-11T10:04:04.285705Z  INFO shard-manager: text_generation_launcher: shard terminated rank=0
Error: WebserverFailed

The text was updated successfully, but these errors were encountered:

janne-alatalo · 2024-12-16T10:28:23Z

This bug seems to be caused by the default value for --cuda-graphs. I get the same error when running the text-generation-launcher command directly. Workaround is to disable cuda graphs, e.g.

text-generation-launcher --model-id Qwen2-VL-7B-Instruct --cuda-graphs 0

DongyoungKim2 mentioned this issue Dec 17, 2024

random text generation from Qwen2-VL-7B-Instruct with TGI3 #2851

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error for Qwen2-VL-2B-Instruct using v3.0.0 #2823

Error for Qwen2-VL-2B-Instruct using v3.0.0 #2823

tobiasvanderwerff commented Dec 11, 2024 •

edited

Loading

janne-alatalo commented Dec 16, 2024

Error for Qwen2-VL-2B-Instruct using v3.0.0 #2823

Error for Qwen2-VL-2B-Instruct using v3.0.0 #2823

Comments

tobiasvanderwerff commented Dec 11, 2024 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

janne-alatalo commented Dec 16, 2024

tobiasvanderwerff commented Dec 11, 2024 •

edited

Loading