Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unkown compute for card nvidia-a100-80gb-pcie #2822

Open
2 of 4 tasks
ferreroal opened this issue Dec 11, 2024 · 4 comments · May be fixed by #2837
Open
2 of 4 tasks

Unkown compute for card nvidia-a100-80gb-pcie #2822

ferreroal opened this issue Dec 11, 2024 · 4 comments · May be fixed by #2837

Comments

@ferreroal
Copy link

System Info

When trying the new TGI v3.0.0 using an AzureML Compute Standard_NC24ads_A100_v4 (https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nca100v4-series?tabs=sizebasic) TGI is not able to detect properly the gpu.

The name doesn't match with the ones listed for A100 cards in

"nvidia-a100-sxm4-80gb" => Gpu::A100,

Trying to deploy Qwen2.5-7B-Coder in full precision i got an error during the Prefill Method. TGI v2.4.0 works without issues, v2.4.1 also fails (it already has some auto values i think).

Complete stack trace without using any values in the Config/Envs:

Instance status:
SystemSetup: Succeeded
UserContainerImagePull: Succeeded
ModelDownload: Succeeded
UserContainerStart: InProgress

Container events:
Kind: Pod, Name: Downloading, Type: Normal, Time: 2024-12-10T16:29:38.664423Z, Message: Start downloading models
Kind: Pod, Name: Pulling, Type: Normal, Time: 2024-12-10T16:29:39.436264Z, Message: Start pulling container image
Kind: Pod, Name: Pulled, Type: Normal, Time: 2024-12-10T16:33:13.545118Z, Message: Container image is pulled successfully
Kind: Pod, Name: Downloaded, Type: Normal, Time: 2024-12-10T16:33:13.545118Z, Message: Models are downloaded successfully
Kind: Pod, Name: Created, Type: Normal, Time: 2024-12-10T16:33:13.570232Z, Message: Created container inference-server
Kind: Pod, Name: Started, Type: Normal, Time: 2024-12-10T16:33:13.772813Z, Message: Started container inference-server
Kind: Pod, Name: ReadinessProbeFailed, Type: Warning, Time: 2024-12-10T16:34:22.215502Z, Message: Readiness probe failed: HTTP probe failed with statuscode: 503

Container logs:
�[2m2024-12-10T16:33:13.791677Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Args {
model_id: "/var/azureml-app/azureml-models/Qwen2_5-Coder-7B-BC/2/output_dir",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: None,
quantize: None,
speculate: None,
dtype: None,
kv_cache_dtype: None,
trust_remote_code: true,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: None,
max_total_tokens: None,
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "mir-user-pod-2233f29894a94ca084027bb52cf77929000000",
port: 80,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: None,
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
api_key: None,
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
lora_adapters: None,
usage_stats: On,
payload_limit: 2000000,
enable_prefill_logprobs: false,
}
�[2m2024-12-10T16:33:15.143210Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Using attention flashinfer - Prefix caching true
�[2m2024-12-10T16:33:15.180035Z�[0m �[33m WARN�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Unkown compute for card nvidia-a100-80gb-pcie
�[2m2024-12-10T16:33:15.217613Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Default max_batch_prefill_tokens to 4096
�[2m2024-12-10T16:33:15.217633Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Using default cuda graphs [1, 2, 4, 8, 16, 32]
�[2m2024-12-10T16:33:15.217638Z�[0m �[33m WARN�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m trust_remote_code is set. Trusting that model /var/azureml-app/azureml-models/Qwen2_5-Coder-7B-BC/2/output_dir do not contain malicious code.
�[2m2024-12-10T16:33:15.217908Z�[0m �[32m INFO�[0m �[1mdownload�[0m: �[2mtext_generation_launcher�[0m�[2m:�[0m Starting check and download process for /var/azureml-app/azureml-models/Qwen2_5-Coder-7B-BC/2/output_dir
�[2m2024-12-10T16:33:17.736806Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Files are already present on the host. Skipping download.
�[2m2024-12-10T16:33:18.129901Z�[0m �[32m INFO�[0m �[1mdownload�[0m: �[2mtext_generation_launcher�[0m�[2m:�[0m Successfully downloaded weights for /var/azureml-app/azureml-models/Qwen2_5-Coder-7B-BC/2/output_dir
�[2m2024-12-10T16:33:18.130146Z�[0m �[32m INFO�[0m �[1mshard-manager�[0m: �[2mtext_generation_launcher�[0m�[2m:�[0m Starting shard �[2m�[3mrank�[0m�[2m=�[0m0�[0m
�[2m2024-12-10T16:33:20.664399Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Using prefix caching = True
�[2m2024-12-10T16:33:20.664453Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Using Attention = flashinfer
�[2m2024-12-10T16:33:28.150232Z�[0m �[32m INFO�[0m �[1mshard-manager�[0m: �[2mtext_generation_launcher�[0m�[2m:�[0m Waiting for shard to be ready... �[2m�[3mrank�[0m�[2m=�[0m0�[0m
�[2m2024-12-10T16:33:31.029573Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Using prefill chunking = True
�[2m2024-12-10T16:33:31.597225Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Server started at unix:///tmp/text-generation-server-0
�[2m2024-12-10T16:33:31.655611Z�[0m �[32m INFO�[0m �[1mshard-manager�[0m: �[2mtext_generation_launcher�[0m�[2m:�[0m Shard ready in 13.518157709s �[2m�[3mrank�[0m�[2m=�[0m0�[0m
�[2m2024-12-10T16:33:31.743685Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Starting Webserver
�[2m2024-12-10T16:33:31.815698Z�[0m �[32m INFO�[0m �[2mtext_generation_router_v3�[0m�[2m:�[0m �[2mbackends/v3/src/lib.rs�[0m�[2m:�[0m�[2m125:�[0m Warming up model
�[2m2024-12-10T16:33:31.836887Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Using optimized Triton indexing kernels.
�[2m2024-12-10T16:33:33.621609Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m KV-cache blocks: 1110376, size: 1
�[2m2024-12-10T16:33:33.634080Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1]
�[2m2024-12-10T16:33:34.610076Z�[0m �[32m INFO�[0m �[2mtext_generation_router_v3�[0m�[2m:�[0m �[2mbackends/v3/src/lib.rs�[0m�[2m:�[0m�[2m137:�[0m Setting max batch total tokens to 1110376
�[2m2024-12-10T16:33:34.610095Z�[0m �[33m WARN�[0m �[2mtext_generation_router_v3::backend�[0m�[2m:�[0m �[2mbackends/v3/src/backend.rs�[0m�[2m:�[0m�[2m39:�[0m Model supports prefill chunking. waiting_served_ratio and max_waiting_tokens will be ignored.
�[2m2024-12-10T16:33:34.610112Z�[0m �[32m INFO�[0m �[2mtext_generation_router_v3�[0m�[2m:�[0m �[2mbackends/v3/src/lib.rs�[0m�[2m:�[0m�[2m166:�[0m Using backend V3
�[2m2024-12-10T16:33:34.610116Z�[0m �[32m INFO�[0m �[2mtext_generation_router�[0m�[2m:�[0m �[2mbackends/v3/src/main.rs�[0m�[2m:�[0m�[2m162:�[0m Maximum input tokens defaulted to 32767
�[2m2024-12-10T16:33:34.610119Z�[0m �[32m INFO�[0m �[2mtext_generation_router�[0m�[2m:�[0m �[2mbackends/v3/src/main.rs�[0m�[2m:�[0m�[2m168:�[0m Maximum total tokens defaulted to 32768
�[2m2024-12-10T16:33:36.853347Z�[0m �[32m INFO�[0m �[2mtext_generation_router::server�[0m�[2m:�[0m �[2mrouter/src/server.rs�[0m�[2m:�[0m�[2m1873:�[0m Using config Some(Qwen2)
�[2m2024-12-10T16:33:36.979425Z�[0m �[33m WARN�[0m �[2mtext_generation_router::server�[0m�[2m:�[0m �[2mrouter/src/server.rs�[0m�[2m:�[0m�[2m1913:�[0m no pipeline tag found for model /var/azureml-app/azureml-models/Qwen2_5-Coder-7B-BC/2/output_dir
�[2m2024-12-10T16:33:36.979442Z�[0m �[33m WARN�[0m �[2mtext_generation_router::server�[0m�[2m:�[0m �[2mrouter/src/server.rs�[0m�[2m:�[0m�[2m2015:�[0m Invalid hostname, defaulting to 0.0.0.0
�[2m2024-12-10T16:33:37.033972Z�[0m �[32m INFO�[0m �[2mtext_generation_router::server�[0m�[2m:�[0m �[2mrouter/src/server.rs�[0m�[2m:�[0m�[2m2402:�[0m Connected
�[2m2024-12-10T16:34:22.215030Z�[0m �[31mERROR�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Method Prefill encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 321, in call
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 728, in main
return _main(
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 197, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 703, in wrapper
return callback(**use_params)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 117, in serve
server.serve(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 315, in serve
asyncio.run(
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(

File "/opt/conda/lib/python3.11/site-packages/text_generation_server/interceptor.py", line 24, in intercept
return await response
File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 183, in Prefill
generations, next_batch, timings = self.model.generate_token(batch)
File "/opt/conda/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 1953, in generate_token
out, speculative_logits = self.forward(batch, adapter_data)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 1848, in forward
logits, speculative_logits = self.model.forward(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 409, in forward
hidden_states = self.model(
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 336, in forward
hidden_states, residual = layer(
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 257, in forward
attn_output = self.self_attn(
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 158, in forward
attn_output = attention(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/layers/attention/cuda.py", line 233, in attention
return prefill_with_paged_kv_state.get().forward(
File "/opt/conda/lib/python3.11/site-packages/flashinfer/prefill.py", line 879, in forward
return self.run(q, paged_kv_cache, k_scale=k_scale, v_scale=v_scale)
File "/opt/conda/lib/python3.11/site-packages/flashinfer/prefill.py", line 939, in run
out = self._wrapper.run(
RuntimeError: BatchPrefillWithPagedKVCache failed with error code an illegal memory access was encountered
�[2m2024-12-10T16:34:22.215241Z�[0m �[31mERROR�[0m �[1mhealth�[0m�[2m:�[0m�[1mhealth�[0m�[2m:�[0m�[1mprefill�[0m�[1m{�[0m�[3mid�[0m�[2m=�[0m18446744073709551615 �[3msize�[0m�[2m=�[0m1�[1m}�[0m�[2m:�[0m�[1mprefill�[0m�[1m{�[0m�[3mid�[0m�[2m=�[0m18446744073709551615 �[3msize�[0m�[2m=�[0m1�[1m}�[0m�[2m:�[0m �[2mtext_generation_router_v3::client�[0m�[2m:�[0m �[2mbackends/v3/src/client/mod.rs�[0m�[2m:�[0m�[2m45:�[0m Server error: BatchPrefillWithPagedKVCache failed with error code an illegal memory access was encountered
�[2m2024-12-10T16:34:23.124632Z�[0m �[31mERROR�[0m �[1mshard-manager�[0m: �[2mtext_generation_launcher�[0m�[2m:�[0m Shard complete standard error output:

2024-12-10 16:33:19.507 | INFO | text_generation_server.utils.import_utils::80 - Detected system cuda
/opt/conda/lib/python3.11/site-packages/text_generation_server/layers/gptq/triton.py:242: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@custom_fwd(cast_inputs=torch.float16)
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@custom_fwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
@custom_bwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@custom_fwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
@custom_bwd
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.
return func(*args, *kwargs)
CUDA Error: an illegal memory access was encountered (700) /tmp/build-via-sdist-fmqwe4he/flashinfer-0.1.6+cu124torch2.4/include/flashinfer/attention/prefill.cuh: line 2370 at function cudaLaunchKernel((void
)kernel, nblks, nthrs, args, smem_size, stream) �[2m�[3mrank�[0m�[2m=�[0m0�[0m
�[2m2024-12-10T16:34:23.167755Z�[0m �[31mERROR�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Shard 0 crashed
�[2m2024-12-10T16:34:23.167784Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Terminating webserver
�[2m2024-12-10T16:34:23.167801Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Waiting for webserver to gracefully shutdown
�[2m2024-12-10T16:34:23.167952Z�[0m �[32m INFO�[0m �[2mtext_generation_router::server�[0m�[2m:�[0m �[2mrouter/src/server.rs�[0m�[2m:�[0m�[2m2494:�[0m signal received, starting graceful shutdown
�[2m2024-12-10T16:34:23.868638Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m webserver terminated
�[2m2024-12-10T16:34:23.868660Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Shutting down shards
Error: ShardFailed

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

none

Expected behavior

Card get detected as A100.

No CUDA Error

@lazariv
Copy link

lazariv commented Dec 12, 2024

We have similar warning with H100:

WARN text_generation_launcher: Unkown compute for card nvidia-h100

which does not match

Gpu::H100 => write!(f, "nvidia-h100-80fb-hbm3"),

@marceljahnke
Copy link

We also have a similar warning with another H100 variant:

WARN text_generation_launcher: Unkown compute for card nvidia-h100-nvl

@lazariv Do you mind adding nvidia-h100-nvl into your PR #2837?

e.g.

...
"nvidia-h100-nvl" => Gpu::H100,
...

@lazariv
Copy link

lazariv commented Dec 13, 2024

@marceljahnke
Done, anyone proficient in Rust might suggest and implement a more elegant solution to encounter for other possible variants

@ismael-dm
Copy link

same issue here: Unkown compute for card nvidia-a100-80gb-pcie
PR seems good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants