Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Supervisor supports restarts #2611

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

paradin
Copy link

@paradin paradin commented Nov 29, 2024

Fixes #2402

Solving the issue where all workers need to be restarted after the Supervisor is restarted.

Plan:

  1. Workers detect the disconnection from the Supervisor;
  2. When workers reconnect to the Supervisor, if there are models running, synchronize the model replica information to the Supervisor;
  3. When the Supervisor receives the model replica information synchronized by the workers, it reconstructs the data related to the model replicas.

@XprobeBot XprobeBot added this to the v1.x milestone Nov 29, 2024
@qinxuye
Copy link
Contributor

qinxuye commented Nov 30, 2024

Please fix lint, your can at the project level, run pre-commit install to do commit check.( ensure pre-commit library installed).

Copy link
Contributor

@qinxuye qinxuye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really appreciate your work! I think in order to ensure this work, we may need to add integrated tests.

https://github.com/xorbitsai/inference/tree/main/xinference/core/tests

We can put the test here, maybe we can start a supervisor and worker, then kill the supervisor, restart it, to see if the model replica info successfully synced.

.pre-commit-config.yaml Show resolved Hide resolved
@qinxuye qinxuye changed the title Feature: Supervisor supports restarts FEAT: Supervisor supports restarts Dec 3, 2024
xinference/core/tests/test_restart_supervisor.py Outdated Show resolved Hide resolved
xinference/core/tests/test_restart_supervisor.py Outdated Show resolved Hide resolved
xinference/core/tests/test_restart_supervisor.py Outdated Show resolved Hide resolved
xinference/core/tests/test_restart_supervisor.py Outdated Show resolved Hide resolved
@qinxuye
Copy link
Contributor

qinxuye commented Dec 4, 2024

Lint failed, please fix it.

@qinxuye
Copy link
Contributor

qinxuye commented Dec 11, 2024

Tests seem all failed.

@qinxuye qinxuye force-pushed the feature_supervisor_support_restart branch from d3217f1 to e2f184e Compare December 12, 2024 18:14
@qinxuye
Copy link
Contributor

qinxuye commented Dec 12, 2024

I fixed the tests, the test can run now(only run in GPU CI), but it seemed the sync_worker did not work, can you try to make the test work?

We are gonna release a new version today, hopefully we can catch up with it.

@qinxuye
Copy link
Contributor

qinxuye commented Dec 13, 2024

Now the CI failed at the model cannot be found after supervisor restarted.

@paradin
Copy link
Author

paradin commented Dec 15, 2024

I fixed the tests, the test can run now(only run in GPU CI), but it seemed the sync_worker did not work, can you try to make the test work?

We are gonna release a new version today, hopefully we can catch up with it.

Maybe the supervisor is restarted too fast without sleep.

@qinxuye
Copy link
Contributor

qinxuye commented Dec 17, 2024

Looks like it cannot pass still.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

集群模式部署,如果重启supervisor,必须重启所有woker吗?
3 participants