关于单卡多模型加载 #2708

luckfu · 2024-12-26T06:52:16Z

System Info / 系統信息

cuda 12.2

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

docker / docker
pip install / 通过 pip install 安装
installation from source / 从源码安装

Version info / 版本信息

1.1.0

The command used to start Xinference / 用以启动 xinference 的命令

xinference  launch --model-name qwen2.5-instruct \
>   --model-type LLM \
>   --model-uid Qwen1_5B \
>   --model_path /models/Qwen/Qwen2___5-1___5B-Instruct \
>   --model-engine 'vllm' \
>   --model-format 'pytorch' \
>   --quantization None \
>   --n-gpu 1\
>   --gpu-idx "0" \
>   --tensor_parallel_size 1 \
>   --gpu_memory_utilization 0.30 \
>   --max_model_len 4096

Reproduction / 复现过程

xinference  launch --model-name qwen2.5-instruct \
>   --model-type LLM \
>   --model-uid Qwen1_5B \
>   --model_path /models/Qwen/Qwen2___5-1___5B-Instruct \
>   --model-engine 'vllm' \
>   --model-format 'pytorch' \
>   --quantization None \
>   --n-gpu 1\
>   --gpu-idx "0" \
>   --tensor_parallel_size 1 \
>   --gpu_memory_utilization 0.30 \
>   --max_model_len 4096
Launch model name: qwen2.5-instruct with kwargs: {'model_path': '/models/Qwen/Qwen2___5-1___5B-Instruct', 'tensor_parallel_size': 1, 'gpu_memory_utilization': 0.3, 'max_model_len': 4096}
Traceback (most recent call last):
  File "/usr/local/bin/xinference", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/xinference/deploy/cmdline.py", line 908, in model_launch
    model_uid = client.launch_model(
  File "/usr/local/lib/python3.10/dist-packages/xinference/client/restful/restful_client.py", line 999, in launch_model
    raise RuntimeError(
RuntimeError: Failed to launch model, detail: [address=0.0.0.0:26194, pid=237] User specified GPU index 0 has been occupied with a vLLM model: Qwen0_5B-0, therefore cannot allocate GPU memory for a new model.

实际上显存有空闲

Expected behavior / 期待表现

在显存允许的范围内，单卡可以加载多个模型

XprobeBot added the gpu label Dec 26, 2024

XprobeBot added this to the v1.x milestone Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于单卡多模型加载 #2708

关于单卡多模型加载 #2708

luckfu commented Dec 26, 2024

关于单卡多模型加载 #2708

关于单卡多模型加载 #2708

Comments

luckfu commented Dec 26, 2024

System Info / 系統信息

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

Version info / 版本信息

The command used to start Xinference / 用以启动 xinference 的命令

Reproduction / 复现过程

Expected behavior / 期待表现