Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

集群模式部署,如果重启supervisor,必须重启所有woker吗? #2402

Closed
1 of 3 tasks
paradin opened this issue Oct 8, 2024 · 7 comments · May be fixed by #2611
Closed
1 of 3 tasks

集群模式部署,如果重启supervisor,必须重启所有woker吗? #2402

paradin opened this issue Oct 8, 2024 · 7 comments · May be fixed by #2611
Labels
Milestone

Comments

@paradin
Copy link

paradin commented Oct 8, 2024

System Info / 系統信息

docker

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

  • docker / docker
  • pip install / 通过 pip install 安装
  • installation from source / 从源码安装

Version info / 版本信息

0.15.2

The command used to start Xinference / 用以启动 xinference 的命令

分布式场景,正常启动 supervisor 和 worker
supervisor启动指定了supervisor-port
在worker上启动一个模型,如:bge-m3

Reproduction / 复现过程

重启supervisor,前端无法查看正在运行的模型 bge-m3;模型服务不可用;

Expected behavior / 期待表现

  1. supervisor重启后,已经运行的模型正常
  2. 模型服务正常
@XprobeBot XprobeBot added this to the v0.15 milestone Oct 8, 2024
@pkunight
Copy link

我也发现了这个问题, 必须先启动supervisor, 后启动worker, 而且此后连接不能中断. 否则即使supervisor成功重启了, worker依然会持续报错连不上supervisor的ip地址.

@paradin
Copy link
Author

paradin commented Oct 10, 2024

我也发现了这个问题, 必须先启动supervisor, 后启动worker, 而且此后连接不能中断. 否则即使supervisor成功重启了, worker依然会持续报错连不上supervisor的ip地址.

启动supervisor时指定supervisor-port的话,重启supervisor后是能够让worker连上的(因为supervisor端口固定了)
但是问题是现在supervisor是有状态的,重启后woker虽然能report_status,但是却没有report running models status

如果supervisor能实现无状态(比如通过redis共享),还能解决目前supervisor单点问题

@ak47947
Copy link

ak47947 commented Oct 12, 2024

我也发现这个问题了,如果这个问题不解决,是没法真正集群使用的

Copy link

This issue is stale because it has been open for 7 days with no activity.

@github-actions github-actions bot added the stale label Oct 19, 2024
Copy link

This issue was closed because it has been inactive for 5 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 24, 2024
@qinxuye qinxuye removed the stale label Dec 12, 2024
@qinxuye qinxuye reopened this Dec 12, 2024
Copy link

This issue is stale because it has been open for 7 days with no activity.

@github-actions github-actions bot added the stale label Dec 19, 2024
Copy link

This issue was closed because it has been inactive for 5 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants