Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: '>=' not supported between instances of 'NoneType' and 'int' #2828

Open
2 of 4 tasks
KartDriver opened this issue Dec 11, 2024 · 1 comment
Open
2 of 4 tasks

Comments

@KartDriver
Copy link

System Info

(text-generation-inference) scin@krakatoa:~$ text-generation-launcher --env
2024-12-11T18:48:31.147398Z INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.80.1
Commit sha: 82c24f7
Docker label: N/A
nvidia-smi:
Wed Dec 11 18:48:30 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05 Driver Version: 560.35.05 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:01:00.0 Off | N/A |
| 0% 42C P8 13W / 275W | 2MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:02:00.0 Off | N/A |
| 0% 44C P8 25W / 275W | 2MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 3090 On | 00000000:03:00.0 Off | N/A |
| 0% 42C P8 24W / 275W | 2MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 3090 On | 00000000:04:00.0 Off | N/A |
| 35% 31C P8 31W / 275W | 2MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA GeForce RTX 3090 On | 00000000:05:00.0 Off | N/A |
| 0% 42C P8 42W / 275W | 2MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
xpu-smi:
N/A
2024-12-11T18:48:31.147473Z INFO text_generation_launcher: Args {
model_id: "bigscience/bloom-560m",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: None,
quantize: None,
speculate: None,
dtype: None,
kv_cache_dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: None,
max_total_tokens: None,
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "0.0.0.0",
port: 3000,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: None,
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
api_key: None,
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: true,
max_client_batch_size: 4,
lora_adapters: None,
usage_stats: On,
payload_limit: 2000000,
enable_prefill_logprobs: false,
}
config.json [00:00:00] [████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 693 B/693 B 4.39 KiB/s (0s)2024-12-11T18:48:34.581849Z INFO text_generation_launcher: Forcing attention to 'flashdecoding' because head dim is not supported by flashinfer, also disabling prefix caching
2024-12-11T18:48:34.581893Z INFO text_generation_launcher: Using attention flashdecoding - Prefix caching 0
2024-12-11T18:48:34.646542Z WARN text_generation_launcher: Unkown compute for card nvidia-geforce-rtx-3090
2024-12-11T18:48:34.712901Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 4096
2024-12-11T18:48:34.712926Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-12-11T18:48:34.713142Z INFO download: text_generation_launcher: Starting check and download process for bigscience/bloom-560m
2024-12-11T18:48:38.229198Z INFO text_generation_launcher: Download file: model.safetensors

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

(text-generation-inference) scin@krakatoa:~$ CUDA_VISIBLE_DEVICES=0,1,2,3 text-generation-launcher
--model-id /mnt/models/meta-llama/Llama-3.3-70B-Instruct/
--sharded true
--num-shard 4
--quantize bitsandbytes
--max-input-tokens 2048
--max-total-tokens 4096
--max-batch-size 8
2024-12-11T18:25:32.807525Z INFO text_generation_launcher: Args {
model_id: "/mnt/models/meta-llama/Llama-3.3-70B-Instruct/",
revision: None,
validation_workers: 2,
sharded: Some(
true,
),
num_shard: Some(
4,
),
quantize: Some(
Bitsandbytes,
),
speculate: None,
dtype: None,
kv_cache_dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: Some(
2048,
),
max_input_length: None,
max_total_tokens: Some(
4096,
),
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: Some(
8,
),
cuda_graphs: None,
hostname: "0.0.0.0",
port: 3000,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: None,
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
api_key: None,
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
lora_adapters: None,
usage_stats: On,
payload_limit: 2000000,
enable_prefill_logprobs: false,
}
2024-12-11T18:25:36.061592Z INFO text_generation_launcher: Using attention flashinfer - Prefix caching true
2024-12-11T18:25:36.061634Z INFO text_generation_launcher: Sharding model on 4 processes
2024-12-11T18:25:36.254821Z WARN text_generation_launcher: Unkown compute for card nvidia-geforce-rtx-3090
2024-12-11T18:25:36.337546Z WARN text_generation_launcher: Not enough VRAM to run the model: Available: 97.93GB - Model 127.68GB.
2024-12-11T18:25:36.337590Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 4096
2024-12-11T18:25:36.337607Z WARN text_generation_launcher: Bitsandbytes is deprecated, use eetq instead, which provides better latencies overall and is drop-in in most cases.
2024-12-11T18:25:36.337622Z WARN text_generation_launcher: Bitsandbytes doesn't work with cuda graphs, deactivating them
2024-12-11T18:25:36.337867Z INFO download: text_generation_launcher: Starting check and download process for /mnt/models/meta-llama/Llama-3.3-70B-Instruct/
2024-12-11T18:25:40.220528Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-12-11T18:25:40.971370Z INFO download: text_generation_launcher: Successfully downloaded weights for /mnt/models/meta-llama/Llama-3.3-70B-Instruct/
2024-12-11T18:25:40.971794Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-12-11T18:25:40.979679Z INFO shard-manager: text_generation_launcher: Starting shard rank=1
2024-12-11T18:25:40.979746Z INFO shard-manager: text_generation_launcher: Starting shard rank=3
2024-12-11T18:25:40.979746Z INFO shard-manager: text_generation_launcher: Starting shard rank=2
2024-12-11T18:25:44.579432Z INFO text_generation_launcher: Using prefix caching = True
2024-12-11T18:25:44.579523Z INFO text_generation_launcher: Using Attention = flashinfer
2024-12-11T18:25:45.106837Z WARN text_generation_launcher: exllamav2_kernels not installed.
2024-12-11T18:25:45.116706Z WARN text_generation_launcher: Could not import Mamba: No module named 'mamba_ssm'
2024-12-11T18:25:51.045493Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-12-11T18:25:51.075031Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2024-12-11T18:25:51.079025Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2024-12-11T18:25:51.085019Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-12-11T18:26:01.091932Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-12-11T18:26:01.128324Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2024-12-11T18:26:01.132268Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2024-12-11T18:26:01.146315Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-12-11T18:26:11.138573Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-12-11T18:26:11.169899Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2024-12-11T18:26:11.173043Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2024-12-11T18:26:11.213768Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-12-11T18:26:21.162127Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-12-11T18:26:21.219010Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2024-12-11T18:26:21.268627Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2024-12-11T18:26:21.306493Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-12-11T18:26:31.257146Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-12-11T18:26:31.300640Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2024-12-11T18:26:31.313775Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2024-12-11T18:26:31.397810Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-12-11T18:26:33.308202Z INFO text_generation_launcher: Using prefill chunking = True
2024-12-11T18:26:36.179734Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
2024-12-11T18:26:36.207682Z INFO shard-manager: text_generation_launcher: Shard ready in 55.201097835s rank=1
2024-12-11T18:26:36.818240Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-2
2024-12-11T18:26:36.818290Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-3
2024-12-11T18:26:36.818590Z INFO shard-manager: text_generation_launcher: Shard ready in 55.801796419s rank=3
2024-12-11T18:26:36.824017Z INFO shard-manager: text_generation_launcher: Shard ready in 55.815034637s rank=2
2024-12-11T18:26:37.306767Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-12-11T18:26:37.384325Z INFO shard-manager: text_generation_launcher: Shard ready in 56.386862739s rank=0
2024-12-11T18:26:37.426239Z INFO text_generation_launcher: Starting Webserver
2024-12-11T18:26:37.518635Z INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2024-12-11T18:26:38.029590Z INFO text_generation_launcher: Using optimized Triton indexing kernels.
2024-12-11T18:26:38.572695Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/home/scin/miniconda3/envs/text-generation-inference/bin/text-generation-server", line 8, in
sys.exit(app())
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/typer/main.py", line 321, in call
return get_command(self)(*args, **kwargs)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/typer/core.py", line 728, in main
return _main(
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/typer/core.py", line 197, in _main
rv = self.invoke(ctx)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/typer/main.py", line 703, in wrapper
return callback(**use_params)
File "/home/scin/text-generation-inference/server/text_generation_server/cli.py", line 117, in serve
server.serve(
File "/home/scin/text-generation-inference/server/text_generation_server/server.py", line 315, in serve
asyncio.run(
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(

File "/home/scin/text-generation-inference/server/text_generation_server/interceptor.py", line 24, in intercept
return await response
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/home/scin/text-generation-inference/server/text_generation_server/server.py", line 144, in Warmup
self.model.warmup(batch, max_input_tokens, max_total_tokens)
File "/home/scin/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 1577, in warmup
_, _batch, _ = self.generate_token(batch)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
File "/home/scin/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 1957, in generate_token
out, speculative_logits = self.forward(batch, adapter_data)
File "/home/scin/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 1839, in forward
with self._forward_context(
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/contextlib.py", line 137, in enter
return next(self.gen)
File "/home/scin/text-generation-inference/server/text_generation_server/layers/attention/flashinfer.py", line 86, in use_prefill_with_paged_kv_state
state.begin_forward(
File "/home/scin/flashinfer/flashinfer/prefill.py", line 1078, in plan
window_left >= 0, # use_sliding_window
TypeError: '>=' not supported between instances of 'NoneType' and 'int'
2024-12-11T18:26:38.573225Z ERROR warmup{max_input_length=Some(2048) max_prefill_tokens=4096 max_total_tokens=Some(4096) max_batch_size=Some(8)}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: '>=' not supported between instances of 'NoneType' and 'int'
2024-12-11T18:26:38.576166Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/home/scin/miniconda3/envs/text-generation-inference/bin/text-generation-server", line 8, in
sys.exit(app())
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/typer/main.py", line 321, in call
return get_command(self)(*args, **kwargs)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/typer/core.py", line 728, in main
return _main(
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/typer/core.py", line 197, in _main
rv = self.invoke(ctx)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/typer/main.py", line 703, in wrapper
return callback(**use_params)
File "/home/scin/text-generation-inference/server/text_generation_server/cli.py", line 117, in serve
server.serve(
File "/home/scin/text-generation-inference/server/text_generation_server/server.py", line 315, in serve
asyncio.run(
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
File "/home/scin/text-generation-inference/server/text_generation_server/interceptor.py", line 24, in intercept
return await response
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/home/scin/text-generation-inference/server/text_generation_server/server.py", line 144, in Warmup
self.model.warmup(batch, max_input_tokens, max_total_tokens)
File "/home/scin/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 1577, in warmup
_, _batch, _ = self.generate_token(batch)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
File "/home/scin/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 1957, in generate_token
out, speculative_logits = self.forward(batch, adapter_data)
File "/home/scin/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 1839, in forward
with self._forward_context(
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/contextlib.py", line 137, in enter
return next(self.gen)
File "/home/scin/text-generation-inference/server/text_generation_server/layers/attention/flashinfer.py", line 86, in use_prefill_with_paged_kv_state
state.begin_forward(
File "/home/scin/flashinfer/flashinfer/prefill.py", line 1078, in plan
window_left >= 0, # use_sliding_window
TypeError: '>=' not supported between instances of 'NoneType' and 'int'
2024-12-11T18:26:38.576425Z ERROR warmup{max_input_length=Some(2048) max_prefill_tokens=4096 max_total_tokens=Some(4096) max_batch_size=Some(8)}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: '>=' not supported between instances of 'NoneType' and 'int'
2024-12-11T18:26:38.576677Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/home/scin/miniconda3/envs/text-generation-inference/bin/text-generation-server", line 8, in
sys.exit(app())
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/typer/main.py", line 321, in call
return get_command(self)(*args, **kwargs)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/typer/core.py", line 728, in main
return _main(
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/typer/core.py", line 197, in _main
rv = self.invoke(ctx)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/typer/main.py", line 703, in wrapper
return callback(**use_params)
File "/home/scin/text-generation-inference/server/text_generation_server/cli.py", line 117, in serve
server.serve(
File "/home/scin/text-generation-inference/server/text_generation_server/server.py", line 315, in serve
asyncio.run(
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
File "/home/scin/text-generation-inference/server/text_generation_server/interceptor.py", line 24, in intercept
return await response
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/home/scin/text-generation-inference/server/text_generation_server/server.py", line 144, in Warmup
self.model.warmup(batch, max_input_tokens, max_total_tokens)
File "/home/scin/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 1577, in warmup
_, _batch, _ = self.generate_token(batch)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
File "/home/scin/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 1957, in generate_token
out, speculative_logits = self.forward(batch, adapter_data)
File "/home/scin/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 1839, in forward
with self._forward_context(
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/contextlib.py", line 137, in enter
return next(self.gen)
File "/home/scin/text-generation-inference/server/text_generation_server/layers/attention/flashinfer.py", line 86, in use_prefill_with_paged_kv_state
state.begin_forward(
File "/home/scin/flashinfer/flashinfer/prefill.py", line 1078, in plan
window_left >= 0, # use_sliding_window
TypeError: '>=' not supported between instances of 'NoneType' and 'int'
2024-12-11T18:26:38.577034Z ERROR warmup{max_input_length=Some(2048) max_prefill_tokens=4096 max_total_tokens=Some(4096) max_batch_size=Some(8)}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: '>=' not supported between instances of 'NoneType' and 'int'
2024-12-11T18:26:38.591025Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/home/scin/miniconda3/envs/text-generation-inference/bin/text-generation-server", line 8, in
sys.exit(app())
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/typer/main.py", line 321, in call
return get_command(self)(*args, **kwargs)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/typer/core.py", line 728, in main
return _main(
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/typer/core.py", line 197, in _main
rv = self.invoke(ctx)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/typer/main.py", line 703, in wrapper
return callback(**use_params)
File "/home/scin/text-generation-inference/server/text_generation_server/cli.py", line 117, in serve
server.serve(
File "/home/scin/text-generation-inference/server/text_generation_server/server.py", line 315, in serve
asyncio.run(
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
File "/home/scin/text-generation-inference/server/text_generation_server/interceptor.py", line 24, in intercept
return await response
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/home/scin/text-generation-inference/server/text_generation_server/server.py", line 144, in Warmup
self.model.warmup(batch, max_input_tokens, max_total_tokens)
File "/home/scin/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 1577, in warmup
_, _batch, _ = self.generate_token(batch)
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
File "/home/scin/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 1957, in generate_token
out, speculative_logits = self.forward(batch, adapter_data)
File "/home/scin/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 1839, in forward
with self._forward_context(
File "/home/scin/miniconda3/envs/text-generation-inference/lib/python3.11/contextlib.py", line 137, in enter
return next(self.gen)
File "/home/scin/text-generation-inference/server/text_generation_server/layers/attention/flashinfer.py", line 86, in use_prefill_with_paged_kv_state
state.begin_forward(
File "/home/scin/flashinfer/flashinfer/prefill.py", line 1078, in plan
window_left >= 0, # use_sliding_window
TypeError: '>=' not supported between instances of 'NoneType' and 'int'
2024-12-11T18:26:38.591400Z ERROR warmup{max_input_length=Some(2048) max_prefill_tokens=4096 max_total_tokens=Some(4096) max_batch_size=Some(8)}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: '>=' not supported between instances of 'NoneType' and 'int'
Error: Backend(Warmup(Generation("'>=' not supported between instances of 'NoneType' and 'int'")))
2024-12-11T18:26:38.606855Z ERROR text_generation_launcher: Webserver Crashed
2024-12-11T18:26:38.606883Z INFO text_generation_launcher: Shutting down shards
2024-12-11T18:26:38.611513Z INFO shard-manager: text_generation_launcher: Terminating shard rank=1
2024-12-11T18:26:38.611567Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=1
2024-12-11T18:26:38.622656Z INFO shard-manager: text_generation_launcher: Terminating shard rank=3
2024-12-11T18:26:38.622711Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=3
2024-12-11T18:26:38.627376Z INFO shard-manager: text_generation_launcher: Terminating shard rank=2
2024-12-11T18:26:38.627442Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=2
2024-12-11T18:26:38.687423Z INFO shard-manager: text_generation_launcher: Terminating shard rank=0
2024-12-11T18:26:38.687508Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0
2024-12-11T18:26:40.426587Z INFO shard-manager: text_generation_launcher: shard terminated rank=3
2024-12-11T18:26:40.793289Z INFO shard-manager: text_generation_launcher: shard terminated rank=0
2024-12-11T18:26:40.816993Z INFO shard-manager: text_generation_launcher: shard terminated rank=1
2024-12-11T18:26:40.832001Z INFO shard-manager: text_generation_launcher: shard terminated rank=2
Error: WebserverFailed

Expected behavior

Successfully start the webserver.

@Snehallaldas
Copy link

Start the server with a simpler model to ensure the environment is set up correctly.
If this works, the issue may be specific to bloom-560m.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants