You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000001:00:00.0 Off | 0 |
| N/A 36C P0 63W / 300W | 72269MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 9984 C /opt/conda/bin/python3.11 72256MiB |
+-----------------------------------------------------------------------------------------
Information
Docker
The CLI directly
Tasks
An officially supported command
My own modifications
Reproduction
Run this Docker command:
docker run --gpus all --shm-size 1g -p 8080:80 -v $PWD/data:/data \
-e USE_FLASH_ATTENTION=true \
ghcr.io/huggingface/text-generation-inference:3.0.0 \
--model-id Qwen/Qwen2-VL-2B-Instruct
Expected behavior
This fails for v3.0.0, but works for 2.4.1. Output for 3.0.0:
2024-12-11T10:03:46.972045Z INFO text_generation_launcher: Args {
model_id: "Qwen/Qwen2-VL-2B-Instruct",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: None,
quantize: None,
speculate: None,
dtype: None,
kv_cache_dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: Some(
32768,
),
max_input_length: None,
max_total_tokens: Some(
128000,
),
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: Some(
32768,
),
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "30f25418466f",
port: 80,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: None,
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
api_key: None,
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
lora_adapters: None,
usage_stats: On,
payload_limit: 2000000,
enable_prefill_logprobs: false,
}
2024-12-11T10:03:46.972098Z INFO hf_hub: Token file not found "/data/token"
2024-12-11T10:03:48.240902Z INFO text_generation_launcher: Disabling prefix caching because of VLM model
2024-12-11T10:03:48.240923Z INFO text_generation_launcher: Using attention flashinfer - Prefix caching 0
2024-12-11T10:03:48.240947Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-12-11T10:03:48.241043Z INFO download: text_generation_launcher: Starting check and download process for Qwen/Qwen2-VL-2B-Instruct
2024-12-11T10:03:50.873983Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-12-11T10:03:51.356039Z INFO download: text_generation_launcher: Successfully downloaded weights for Qwen/Qwen2-VL-2B-Instruct
2024-12-11T10:03:51.356269Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-12-11T10:03:53.865482Z INFO text_generation_launcher: Using prefix caching = False
2024-12-11T10:03:53.865517Z INFO text_generation_launcher: Using Attention = flashinfer
2024-12-11T10:04:00.459738Z INFO text_generation_launcher: Using prefill chunking = False
2024-12-11T10:03:53.865517Z INFO text_generation_launcher: Using Attention = flashinfer
2024-12-11T10:04:00.459738Z INFO text_generation_launcher: Using prefill chunking = False
2024-12-11T10:04:00.597656Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-12-11T10:04:00.680759Z INFO shard-manager: text_generation_launcher: Shard ready in 9.313982844s rank=0
2024-12-11T10:04:00.768525Z INFO text_generation_launcher: Starting Webserver
2024-12-11T10:04:00.802105Z INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2024-12-11T10:04:00.921993Z INFO text_generation_launcher: Using optimized Triton indexing kernels.
2024-12-11T10:04:03.592966Z INFO text_generation_launcher: KV-cache blocks: 2484579, size: 1
2024-12-11T10:04:03.614039Z INFO text_generation_launcher: Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1]
2024-12-11T10:04:03.877415Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 321, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 728, in main
return _main(
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 197, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 703, in wrapper
return callback(**use_params)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 117, in serve
server.serve(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 315, in serve
asyncio.run(
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.11/site-packages/text_generation_server/interceptor.py", line 24, in intercept
return await response
File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 144, in Warmup
self.model.warmup(batch, max_input_tokens, max_total_tokens)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 1688, in warmup
self.cuda_graph_warmup(bs, max_total_tokens, max_total_tokens)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 1495, in cuda_graph_warmup
self.model.forward(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/qwen2_vl.py", line 536, in forward
hidden_states = self.text_model(
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 336, in fo
rward
hidden_states, residual = layer(
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 257, in fo
rward
attn_output = self.self_attn(
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/layers/rotary.py", line 57, in forward
rotary_emb.apply_rotary(q1, q2, cos, sin, q1, q2, False)
RuntimeError: The size of tensor a (16) must match the size of tensor b (32) at non-singleton dimension 0
2024-12-11T10:04:03.877681Z ERROR warmup{max_input_length=Some(32768) max_prefill_tokens=32768 max_total_tokens=Some(128000) max_batch_
size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: The size of tensor a (16) must ma
tch the size of tensor b (32) at non-singleton dimension 0
Error: Backend(Warmup(Generation("The size of tensor a (16) must match the size of tensor b (32) at non-singleton dimension 0")))
2024-12-11T10:04:03.895576Z ERROR text_generation_launcher: Webserver Crashed
2024-12-11T10:04:03.895597Z INFO text_generation_launcher: Shutting down shards
2024-12-11T10:04:03.985108Z INFO shard-manager: text_generation_launcher: Terminating shard rank=0
2024-12-11T10:04:03.985343Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0
2024-12-11T10:04:04.285705Z INFO shard-manager: text_generation_launcher: shard terminated rank=0
Error: WebserverFailed
The text was updated successfully, but these errors were encountered:
This bug seems to be caused by the default value for --cuda-graphs. I get the same error when running the text-generation-launcher command directly. Workaround is to disable cuda graphs, e.g.
System Info
Information
Tasks
Reproduction
Run this Docker command:
docker run --gpus all --shm-size 1g -p 8080:80 -v $PWD/data:/data \ -e USE_FLASH_ATTENTION=true \ ghcr.io/huggingface/text-generation-inference:3.0.0 \ --model-id Qwen/Qwen2-VL-2B-Instruct
Expected behavior
This fails for v3.0.0, but works for 2.4.1. Output for 3.0.0:
The text was updated successfully, but these errors were encountered: