Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error for Qwen2-VL-2B-Instruct using v3.0.0 #2823

Open
2 of 4 tasks
tobiasvanderwerff opened this issue Dec 11, 2024 · 1 comment
Open
2 of 4 tasks

Error for Qwen2-VL-2B-Instruct using v3.0.0 #2823

tobiasvanderwerff opened this issue Dec 11, 2024 · 1 comment

Comments

@tobiasvanderwerff
Copy link

tobiasvanderwerff commented Dec 11, 2024

System Info

  • Ubuntu 22.04.5
  • Python 3.11
  • NVIDIA A100
  • model: Qwen/Qwen2-VL-2B-Instruct
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          Off |   00000001:00:00.0 Off |                    0 |
| N/A   36C    P0             63W /  300W |   72269MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      9984      C   /opt/conda/bin/python3.11                   72256MiB |
+-----------------------------------------------------------------------------------------

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Run this Docker command:

docker run --gpus all --shm-size 1g -p 8080:80 -v $PWD/data:/data \
    -e USE_FLASH_ATTENTION=true \
    ghcr.io/huggingface/text-generation-inference:3.0.0 \
    --model-id Qwen/Qwen2-VL-2B-Instruct

Expected behavior

This fails for v3.0.0, but works for 2.4.1. Output for 3.0.0:

2024-12-11T10:03:46.972045Z  INFO text_generation_launcher: Args {                                                                     
    model_id: "Qwen/Qwen2-VL-2B-Instruct",                                                                                             
    revision: None,                                                                                                                    
    validation_workers: 2,                                                                                                             
    sharded: None,                                                                                                                     
    num_shard: None,                                                                                                                   
    quantize: None,                                                                                                                    
    speculate: None,                                                                                                                   
    dtype: None,                                                                                                                       
    kv_cache_dtype: None,                                                                                                              
    trust_remote_code: false,                                                                                                          
    max_concurrent_requests: 128,                                                                                                      
    max_best_of: 2,                                                                                                                    
    max_stop_sequences: 4,                                                                                                             
    max_top_n_tokens: 5,                                                                                                               
    max_input_tokens: Some(                                                                                                            
        32768,                                                                                                                         
    ),                                                                                                                                 
    max_input_length: None,                                                                                                            
    max_total_tokens: Some(                                                                                                            
        128000,                                                                                                                        
    ),                                                                                                                                 
    waiting_served_ratio: 0.3,                                                                                                         
    max_batch_prefill_tokens: Some(                                                                                                    
        32768,                                                                                                                         
    ),                                                                                                                                 
    max_batch_total_tokens: None,                                                                                                      
    max_waiting_tokens: 20,                                                                                                            
    max_batch_size: None,                                                                                                              
    cuda_graphs: None,                                                                                                                 
    hostname: "30f25418466f",                                                                                                          
    port: 80,                                                                                                                          
    shard_uds_path: "/tmp/text-generation-server",                                                                                     
    master_addr: "localhost",                                                                                                          
    master_port: 29500,                                                                                                                
    huggingface_hub_cache: None,                                                                                                       
    weights_cache_override: None,                                                                                                      
    disable_custom_kernels: false,                                                                                                     
    cuda_memory_fraction: 1.0,                                                                                                         
    rope_scaling: None,                                                                                                                
    rope_factor: None,                                                                                                                 
    json_output: false,                                                                                                                
    otlp_endpoint: None,                                                                                                               
    otlp_service_name: "text-generation-inference.router",                                                                             
    cors_allow_origin: [],                                                                                                             
    api_key: None,                                                                                                                     
    watermark_gamma: None,                                                                                                             
    watermark_delta: None,                                                                                                             
    ngrok: false,                                                                                                                      
    ngrok_authtoken: None,                                                                                                             
    ngrok_edge: None,                                                                                                                  
    tokenizer_config_path: None,                                                                                                       
    disable_grammar_support: false,                                                                                                    
    env: false,                                                                                                                        
    max_client_batch_size: 4,                                                                                                          
    lora_adapters: None,                                                                                                               
    usage_stats: On,                                                                                                                   
    payload_limit: 2000000,                                                                                                            
    enable_prefill_logprobs: false,                                                                                                    
}                                                                                                                                      
2024-12-11T10:03:46.972098Z  INFO hf_hub: Token file not found "/data/token"                                                           
2024-12-11T10:03:48.240902Z  INFO text_generation_launcher: Disabling prefix caching because of VLM model                              
2024-12-11T10:03:48.240923Z  INFO text_generation_launcher: Using attention flashinfer - Prefix caching 0                              
2024-12-11T10:03:48.240947Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]                             
2024-12-11T10:03:48.241043Z  INFO download: text_generation_launcher: Starting check and download process for Qwen/Qwen2-VL-2B-Instruct
2024-12-11T10:03:50.873983Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.                  
2024-12-11T10:03:51.356039Z  INFO download: text_generation_launcher: Successfully downloaded weights for Qwen/Qwen2-VL-2B-Instruct
2024-12-11T10:03:51.356269Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-12-11T10:03:53.865482Z  INFO text_generation_launcher: Using prefix caching = False
2024-12-11T10:03:53.865517Z  INFO text_generation_launcher: Using Attention = flashinfer
2024-12-11T10:04:00.459738Z  INFO text_generation_launcher: Using prefill chunking = False
2024-12-11T10:03:53.865517Z  INFO text_generation_launcher: Using Attention = flashinfer
2024-12-11T10:04:00.459738Z  INFO text_generation_launcher: Using prefill chunking = False
2024-12-11T10:04:00.597656Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-12-11T10:04:00.680759Z  INFO shard-manager: text_generation_launcher: Shard ready in 9.313982844s rank=0
2024-12-11T10:04:00.768525Z  INFO text_generation_launcher: Starting Webserver
2024-12-11T10:04:00.802105Z  INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2024-12-11T10:04:00.921993Z  INFO text_generation_launcher: Using optimized Triton indexing kernels.
2024-12-11T10:04:03.592966Z  INFO text_generation_launcher: KV-cache blocks: 2484579, size: 1
2024-12-11T10:04:03.614039Z  INFO text_generation_launcher: Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1]
2024-12-11T10:04:03.877415Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module> 
    sys.exit(app())
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 321, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 728, in main
    return _main(
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 197, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 703, in wrapper
    return callback(**use_params) 
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 117, in serve
    server.serve(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 315, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.11/site-packages/text_generation_server/interceptor.py", line 24, in intercept
    return await response
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 144, in Warmup
    self.model.warmup(batch, max_input_tokens, max_total_tokens)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 1688, in warmup
    self.cuda_graph_warmup(bs, max_total_tokens, max_total_tokens)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 1495, in cuda_graph_warmup
    self.model.forward(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/qwen2_vl.py", line 536, in forward
    hidden_states = self.text_model(
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 336, in fo
rward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 257, in fo
rward
    attn_output = self.self_attn( 
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/layers/rotary.py", line 57, in forward
    rotary_emb.apply_rotary(q1, q2, cos, sin, q1, q2, False)
RuntimeError: The size of tensor a (16) must match the size of tensor b (32) at non-singleton dimension 0
2024-12-11T10:04:03.877681Z ERROR warmup{max_input_length=Some(32768) max_prefill_tokens=32768 max_total_tokens=Some(128000) max_batch_
size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: The size of tensor a (16) must ma
tch the size of tensor b (32) at non-singleton dimension 0
Error: Backend(Warmup(Generation("The size of tensor a (16) must match the size of tensor b (32) at non-singleton dimension 0")))
2024-12-11T10:04:03.895576Z ERROR text_generation_launcher: Webserver Crashed
2024-12-11T10:04:03.895597Z  INFO text_generation_launcher: Shutting down shards
2024-12-11T10:04:03.985108Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=0
2024-12-11T10:04:03.985343Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0
2024-12-11T10:04:04.285705Z  INFO shard-manager: text_generation_launcher: shard terminated rank=0
Error: WebserverFailed
@janne-alatalo
Copy link
Contributor

This bug seems to be caused by the default value for --cuda-graphs. I get the same error when running the text-generation-launcher command directly. Workaround is to disable cuda graphs, e.g.

text-generation-launcher --model-id Qwen2-VL-7B-Instruct --cuda-graphs 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants