-
Notifications
You must be signed in to change notification settings - Fork 614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vz: implement auto save/restore #2900
base: master
Are you sure you want to change the base?
Conversation
Using suspend/resume shortens the startup time of the Docker VM I use from 37 seconds to 13 seconds. |
Do you have an issue describing the feature? Did you look at the PR #642 |
Ah, however, this PR does not implement |
You might want to add that as a separate feature, i.e. describe the auto-suspend |
templates/default.yaml
Outdated
# When "suspend" is enabled, Lima suspends the VM instead of shutting it down. | ||
# The VM can be resumed with `limactl start` if the saved state is available. | ||
# 🟢 Builtin default: false | ||
suspend: null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this yaml property?
Can we just have limactl suspend
command? (Or limactl stop --suspend
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s hard to imagine needing to run limactl suspend
before restarting the host OS just to take advantage of suspend’s benefits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding flags to stop makes it more complicated. I would go with adding suspend/resume commands and keep the ux simple.
You either do:
limactl stop
limactl start
Or:
limactl suspend
limactl resume
This is also virsh interface (using destroy instead of stop).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I delegate the startup and shutdown of my Lima Docker instance to launchd
, so the Docker instance starts and stops with macOS login and logout. In this PR, I aim to change that behavior to use resume/suspend instead.
For this, I want hostagent
to suspend when it receives a SIGINT (sent by limactl stop
or launchd
) if the .suspend
option is set to true
, and to resume on launch if it was previously suspended.
Adding limactl (suspend|resume)
would require a separate mechanism to send/receive something other than SIGINT to handle suspension, increasing the complexity of the changes and not aligning with launchd
shutdown handling. These aren’t changes I’m interested in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this yaml property?
Passing it as a command-line option to limactl start
that would then be forwarded to hostagent
could work as well. However, I’m also considering setting .suspend = true
in override.conf
, so I’d prefer to keep it as an option in limayaml
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why you need the extra field in limayaml
.
I think I would just use SIGUSR1 to mean stop
and leave SIGINT to mean suspend
. That way everything works automatically, including your launchd
scenario, and you can still explicitly shut down the VM when you want to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On further thoughts I think limactl stop
shut do a suspend, and limactl stop --shutdown
(and limactl stop -f
) should shut down and not suspend.
Eventually we could add a limactl reboot
(or restart
) command because I think the only time you really want to shutdown the instance is when you want a fresh boot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why you need the extra field in
limayaml
.
The reason is that I think there hasn't been enough validation of suspending the guest OS with SIGINT in the vz environment. It might be fine to make this behavior the default in the future, but for now, it should be opt-in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point, I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having an option in the yaml and in the command line sounds like best way:
- If you always want the feature, you use change the default or override yaml
- If you want the feature only for some vms, you enable it in the instance yaml
- If you want the feature only in some cases, you enable it when stopping
- If works like other features, easier to use, easier to maintain
In addition to |
And to apply the latest |
|
I always edit |
A restart command might be needed sooner than I expected. Currently, I restart using: launchctl kickstart -kp gui/501/io.lima-vm.autostart.docker; tail -F ~/.lima/docker/{ha.stderr,launchd.stderr,serialv}.log However, with |
So how would you deal with this? When you And when you I guess that works, but is maybe too magical? |
Perhaps it would be better to shut down instead of suspending if the |
Exactly. For that, I’m thinking of saving the |
Maybe, but you would lose all your running state/workloads, so shouldn't there be a prompt? That's why Now that I think about it more, I think if you edit |
Would it be better if |
This does not work if the suspend happens in the background, e.g. because you are logging out. Same thing when you automatically resume. In both case you have no chance to prompt the user. If the user edits |
cce12c0
to
8b4dc0f
Compare
Updated PR description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Automatically saving guest state is nice improvement but it should not be the default due to resource usage and the fact that it is not needed or wanted in many cases.
We should not use suspend/resume for this feature to allow adding support for quick suspend/resume that do not stop the instance. This also match better the VZ API:
Documentation is missing. It is better to start with documentation, before implementing. This helps to discuss the big picture, evaluate the user experience, and is much quicker to iterate on. |
8b4dc0f
to
afa2744
Compare
updated commits and PR description to aligning the terminology with libvirt:
|
Why not use the same names as in the qemu implementation? (suspend/resume) We already used "save" (and "load") for saving/loading a snapshot, internally that is. |
Because its behavior differs from the QEMU implementation. |
Right, then it is a good thing to use different names. Actually the commands were called STOP and CONT in qemu. For suspending to disk, the state is also called SUSPEND (standby/S3) or SUSPEND_DISK (hibernate/S4). My bad. |
- Only `aarch64` is supported due to lack of support in `Code-Hex/vz/v3`. - Attempt to restore if the `vz-machine-state` file is found in the instance directory. If this fails, proceed with a cold boot. - On stop, always attempt to save to the `vz-machine-state` file; if this fails, perform a shutdown. - Guest OS and Guest Agent behavior are not considered here. Handling may be needed for saving while port forwarding is active. Signed-off-by: Norio Nomura <[email protected]>
default `false` Signed-off-by: Norio Nomura <[email protected]>
a2c6da7
to
46cb122
Compare
The changes to ensure that the same |
if err := os.Remove(machineStatePath); err != nil && !errors.Is(err, os.ErrNotExist) { | ||
logrus.WithError(err).Errorf("Failed to remove the old VZ machine state file %q", machineStatePath) | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can write machine state to vz-machine-state.tmp
and then replace the current state by renaming vz-machine-state.tmp
to vz-machine-state
. This ensures that we always have consistent state, either the previous one or the new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, I don't see the value in keeping the vz-machine-state
after a resume. This is because the current implementation does not save disk snapshots, and there is no mechanism to revert the guest OS's changes to the diffdisk after a resume to restore the state to match when the vz-machine-state
was saved.
By the time we reach this point, the guest OS should already have gone through StatusRunning
, so I don't think it's worth complicating things to maintain consistency with the previous vz-machine-state
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Maybe add a comment here to explain why we don't keep the state?
if err := os.Remove(machineStatePath); err != nil { | ||
// We should log the error but continue the process, because the machine state is already restored | ||
logrus.WithError(err).Errorf("Failed to remove the VZ machine state file %q", machineStatePath) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we always remove the saved state? what if you want to always start with the same saved state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if you want to always start with the same saved state?
As I mentioned in a reply to another comment earlier, achieving that would require a snapshot feature that includes saving and restoring the diffdisk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we can add this later.
logrus.WithError(err).Warn("Failed to save VZ. Falling back to shutdown") | ||
} else { | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code to add saveOnStop must be added before you change Stop(). Each commit should be correct so we can easily bisect issues.
The order of commits should be:
- Add saveOnStop option (it can do nothing at this point)
- Save when stopping, considering the new option
If you don't want to add unused option, add the internal flag defaulting to false in the first commit, and expose the option in the second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code to add saveOnStop must be added before you change Stop().
That should already be the case. Could you double-check to ensure you're not looking at an outdated commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still the case - the first commit ignore the option added in the second commit. If we need to revert the second commit we change the behavior to always save.
Signed-off-by: Norio Nomura <[email protected]>
…s to `ha.sock` - pkg/driver: Add `Driver.RuntimeConfig` - pkg/hostagent/api/client: Add `HostAgentClient.DriverConfig` - pkg/hostagent/api/server: Add `Backend.DriverConfig` - pkg/hostagent: Add `HostAgent.DriverRuntimeConfig` - pkg/vz: Add `LimaVzDriver.RuntimeConfig` - pkg/instance: Add `saveOnStop` parameter to `StopGracefully` Signed-off-by: Norio Nomura <[email protected]>
8c3fb15
to
beab7d4
Compare
I've applied the review suggestions. |
```console $ limactl save --help Save an instance Usage: limactl save INSTANCE [flags] Flags: -h, --help help for save Global Flags: --debug debug mode --log-format string Set the logging format [text, json] (default "text") --log-level string Set the logging level [trace, debug, info, warn, error] --tty Enable TUI interactions such as opening an editor. Defaults to true when stdout is a terminal. Set to false for automation. (default true) ``` Signed-off-by: Norio Nomura <[email protected]>
Signed-off-by: Norio Nomura <[email protected]>
Signed-off-by: Norio Nomura <[email protected]>
beab7d4
to
7f4df6c
Compare
I force-pushed to remove a stray code block that had no effect. |
} | ||
|
||
// Remove the old machine state file if it exists, | ||
// because saving the machine state will fail if the file already exists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing to avoid failure on save is not a good reason, save can replace existing file.
logrus.WithError(err).Warn("Failed to get macOS product version") | ||
} else if macOSProductVersion.LessThan(*semver.New("14.0.0")) { | ||
logrus.Warn("saveOnStop is not supported on macOS prior to 14.0") | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will more clear if we extract all the checks to a helper like:
if *y.SaveOnStop {
y.SaveOnStop = validateSaveOnStop(y)
}
In the helper can do something like:
func validateSaveOnStop() *bool {
notSupported := ptr.Of(false)
if condition {
log ...
return notSupported
}
if other condition {
log ...
return notSupported
}
...
return ptr.Of(true)
}
} | ||
|
||
func New(driver *driver.BaseDriver) *LimaVzDriver { | ||
return &LimaVzDriver{ | ||
BaseDriver: driver, | ||
config: LimaVzDriverRuntimeConfig{ | ||
SaveOnStop: *driver.Instance.Config.SaveOnStop, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preparing for more config options?
logrus.WithError(err).Warn("Failed to save VZ. Falling back to shutdown") | ||
} else { | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still the case - the first commit ignore the option added in the second commit. If we need to revert the second commit we change the behavior to always save.
@@ -43,6 +43,7 @@ type Instance struct { | |||
// Hostname, not HostName (corresponds to SSH's naming convention) | |||
Hostname string `json:"hostname"` | |||
Status Status `json:"status"` | |||
Saved bool `json:"saved"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can document here what saved mean?
return nil, err | ||
} | ||
body = bytes.NewBuffer(b) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing both GET and PATCH in the same function is confusing, both for understanding this code, and both for code using this function. If we want to avoid duplicate code for sending a request and getting a response, move the common code to a helper.
}{SaveOnStop: saveOnStop} | ||
_, err = haClient.DriverConfig(ctx, disableSaveOnStopConfig) | ||
if err != nil { | ||
return fmt.Errorf("failed to disable saveOnStop: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we treating this as disable when the value can be true?
return fmt.Errorf("failed to disable saveOnStop: %w", err) | ||
} | ||
} else if saveOnStop { | ||
return fmt.Errorf("save is not supported for %q", inst.VMType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing input validation first is better and more clear, something like:
if saveOnStop && inst.VMType != limayaml.VZ {
return fmt.Errorf(...)
}
This keeps the happy path very clear, and it is easy to very that we did not forget to handle all abnormal cases.
return nil, fmt.Errorf("failed to get macOS product version: %w", err) | ||
} else if macOSProductVersion.Major < 14 { | ||
return nil, fmt.Errorf("saveOnStop is not supported on macOS %d", macOSProductVersion.Major) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a copy of the block validating the config in a previous commit. Maybe we can have one place doing this check:
save_restore.Validate(instance) error
Then we can call it in both places,
cmd/limactl/save.go
Outdated
func newSaveCommand() *cobra.Command { | ||
saveCmd := &cobra.Command{ | ||
Use: "save INSTANCE", | ||
Short: "Save an instance", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also stop the instance but the name or the help do not explain this. I think this is confusing user interface even if we improve the help text, and it is better to have:
limactl stop --save
Using --save and --force should be an error.
This also more consistent with saveOnStop instance config. You either enable save on stop for the instance, or for specific stop of the instance.
@@ -523,4 +523,7 @@ func warnExperimental(y *LimaYAML) { | |||
if y.MountInotify != nil && *y.MountInotify { | |||
logrus.Warn("`mountInotify` is experimental") | |||
} | |||
if y.SaveOnStop != nil && *y.SaveOnStop { | |||
logrus.Warn("`saveOnStop` is experimental") | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense but we should add it to the commit adding the option. Do yo plan to squash all the commits to one commit later?
I have other priorities to focus on, so I converted this to a draft for now. |
aarch64
is supported due to lack of support inCode-Hex/vz/v3
.vz-machine-state
file is found in the instance directory. If this fails, proceed with a cold boot.vz-machine-state
file; if this fails, perform a shutdown.I will mark this as Ready for Review once the following are addressed:
.saveOnStop
tolimayaml
.limactl resume
: Not needed.limactl start
: If the saved state is still present, restore.limactl save
: Request. saveOnStop = true
toha.sock
and send SIGINT tohostagent
.limactl stop
: Request. saveOnStop = false
toha.sock
and send SIGINT tohostagent
.hostagent
: Upon receiving SIGINT, if. saveOnStop
is true, save; if false, shutdown.hostagent
: Add an API accepting request to update. saveOnStop
toha.sock
Items to be addressed in a separate PR due to the potential evolution into a snapshot implementation:
limactl save
: Iflima.yaml
is changed since starting VM, promptlimactl stop
, orlimactl reboot
to apply changes to VM.hostagent
: Use the samelima.yaml
andcidata.iso
for both save and restore.lima.yaml
andcidata.iso
for next restoring.lima.yaml
andcidata.iso
kept on starting.Future needs not included in this PR:
limactl reboot
: TBDThis implements part of # 598.I’ll withdraw the reference to the issue since it seems to be focused specifically on QEMU.