-
Notifications
You must be signed in to change notification settings - Fork 477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT: Supervisor supports restarts #2611
base: main
Are you sure you want to change the base?
FEAT: Supervisor supports restarts #2611
Conversation
Please fix lint, your can at the project level, run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really appreciate your work! I think in order to ensure this work, we may need to add integrated tests.
https://github.com/xorbitsai/inference/tree/main/xinference/core/tests
We can put the test here, maybe we can start a supervisor and worker, then kill the supervisor, restart it, to see if the model replica info successfully synced.
Lint failed, please fix it. |
Tests seem all failed. |
d3217f1
to
e2f184e
Compare
I fixed the tests, the test can run now(only run in GPU CI), but it seemed the sync_worker did not work, can you try to make the test work? We are gonna release a new version today, hopefully we can catch up with it. |
…support_restart # Conflicts: # .github/workflows/python.yaml
Now the CI failed at the model cannot be found after supervisor restarted. |
Maybe the supervisor is restarted too fast without sleep. |
Looks like it cannot pass still. |
Fixes #2402
Solving the issue where all workers need to be restarted after the Supervisor is restarted.
Plan: