Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vz: implement auto save/restore #2900

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

norio-nomura
Copy link
Contributor

@norio-nomura norio-nomura commented Nov 13, 2024

  • Only aarch64 is supported due to lack of support in Code-Hex/vz/v3.
  • Attempt to resume if the vz-machine-state file is found in the instance directory. If this fails, proceed with a cold boot.
  • On stop, always attempt to suspend to the vz-machine-state file; if this fails, perform a shutdown.
  • Guest OS and Guest Agent behavior are not considered here. Handling may be needed for suspending while port forwarding is active.

I will mark this as Ready for Review once the following are addressed:

  • Add .saveOnStop to limayaml.
  • limactl resume: Not needed.
  • limactl start: If the saved state is still present, restore.
  • Add limactl save: Request . saveOnStop = true to ha.sock and send SIGINT to hostagent.
  • limactl stop: Request . saveOnStop = false to ha.sock and send SIGINT to hostagent.
  • hostagent: Upon receiving SIGINT, if . saveOnStop is true, save; if false, shutdown.
  • hostagent: Add an API accepting request to update . saveOnStop to ha.sock

Items to be addressed in a separate PR due to the potential evolution into a snapshot implementation:

  • limactl save: If lima.yaml is changed since starting VM, prompt limactl stop, or limactl reboot to apply changes to VM.
  • hostagent: Use the same lima.yaml and cidata.iso for both save and restore.
    • on starting VM, keep lima.yaml and cidata.iso for next restoring.
    • on restoring VM, use lima.yaml and cidata.iso kept on starting.

Future needs not included in this PR:

  • limactl reboot: TBD

This implements part of # 598. I’ll withdraw the reference to the issue since it seems to be focused specifically on QEMU.

@norio-nomura
Copy link
Contributor Author

Using suspend/resume shortens the startup time of the Docker VM I use from 37 seconds to 13 seconds.

@afbjorklund
Copy link
Member

afbjorklund commented Nov 13, 2024

Do you have an issue describing the feature?

Did you look at the PR #642

@norio-nomura
Copy link
Contributor Author

Did you look at the PR #642

Yes, initially I started implementing this as an extension of #642, but since it didn’t align at all with the behavior of the API provided by Virtualization.framework, I implemented it from scratch.

@norio-nomura
Copy link
Contributor Author

I implemented it from scratch.

Ah, however, this PR does not implement limactl suspend or limactl resume. Instead, an option is added to limayaml that allows start/stop to perform resume/suspend actions.

@afbjorklund
Copy link
Member

You might want to add that as a separate feature, i.e. describe the auto-suspend

@norio-nomura norio-nomura changed the title vz: implement suspend/resume vz: implement auto suspend/resume Nov 13, 2024
# When "suspend" is enabled, Lima suspends the VM instead of shutting it down.
# The VM can be resumed with `limactl start` if the saved state is available.
# 🟢 Builtin default: false
suspend: null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this yaml property?
Can we just have limactl suspend command? (Or limactl stop --suspend)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s hard to imagine needing to run limactl suspend before restarting the host OS just to take advantage of suspend’s benefits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding flags to stop makes it more complicated. I would go with adding suspend/resume commands and keep the ux simple.

You either do:

limactl stop
limactl start

Or:

limactl suspend
limactl resume

This is also virsh interface (using destroy instead of stop).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I delegate the startup and shutdown of my Lima Docker instance to launchd, so the Docker instance starts and stops with macOS login and logout. In this PR, I aim to change that behavior to use resume/suspend instead.

For this, I want hostagent to suspend when it receives a SIGINT (sent by limactl stop or launchd) if the .suspend option is set to true, and to resume on launch if it was previously suspended.
Adding limactl (suspend|resume) would require a separate mechanism to send/receive something other than SIGINT to handle suspension, increasing the complexity of the changes and not aligning with launchd shutdown handling. These aren’t changes I’m interested in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this yaml property?

Passing it as a command-line option to limactl start that would then be forwarded to hostagent could work as well. However, I’m also considering setting .suspend = true in override.conf, so I’d prefer to keep it as an option in limayaml.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why you need the extra field in limayaml.

I think I would just use SIGUSR1 to mean stop and leave SIGINT to mean suspend. That way everything works automatically, including your launchd scenario, and you can still explicitly shut down the VM when you want to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further thoughts I think limactl stop shut do a suspend, and limactl stop --shutdown (and limactl stop -f) should shut down and not suspend.

Eventually we could add a limactl reboot (or restart) command because I think the only time you really want to shutdown the instance is when you want a fresh boot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why you need the extra field in limayaml.

The reason is that I think there hasn't been enough validation of suspending the guest OS with SIGINT in the vz environment. It might be fine to make this behavior the default in the future, but for now, it should be opt-in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point, I agree.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an option in the yaml and in the command line sounds like best way:

  • If you always want the feature, you use change the default or override yaml
  • If you want the feature only for some vms, you enable it in the instance yaml
  • If you want the feature only in some cases, you enable it when stopping
  • If works like other features, easier to use, easier to maintain

@norio-nomura
Copy link
Contributor Author

In addition to vz-machine-state, it might be necessary to save the cidata.iso and lima.yaml at the point of suspend for use during resume.

@norio-nomura
Copy link
Contributor Author

In addition to vz-machine-state, it might be necessary to save the cidata.iso and lima.yaml at the point of suspend for use during resume.

And to apply the latest lima.yaml to the guest, a cold boot might be required.

@jandubois
Copy link
Member

to apply the latest lima.yaml to the guest, a cold boot might be required.

limactl edit already requires the machine to be stopped.

@norio-nomura
Copy link
Contributor Author

to apply the latest lima.yaml to the guest, a cold boot might be required.

limactl edit already requires the machine to be stopped.

I always edit lima.yaml in VSCode. 😝

@norio-nomura
Copy link
Contributor Author

Eventually we could add a limactl reboot (or restart) command because I think the only time you really want to shutdown the instance is when you want a fresh boot.

A restart command might be needed sooner than I expected. Currently, I restart using:

launchctl kickstart -kp gui/501/io.lima-vm.autostart.docker; tail -F ~/.lima/docker/{ha.stderr,launchd.stderr,serialv}.log

However, with launchctl, it ends up performing suspend/resume instead, making it impossible to actually restart.

@jandubois
Copy link
Member

I always edit lima.yaml in VSCode. 😝

So how would you deal with this?

When you suspend, but the lima.yaml file has been edited more recently than the cidata.iso, then you will touch the lima.yaml file to make sure it is newer than the saved state.

And when you start, and the lima.yaml has been edited more recently than the vz-machine-state, then you will ask the user if they want to drop the suspended state?

I guess that works, but is maybe too magical?

@norio-nomura
Copy link
Contributor Author

Perhaps it would be better to shut down instead of suspending if the lima.yaml has been modified since startup?

@norio-nomura
Copy link
Contributor Author

And when you start, and the lima.yaml has been edited more recently than the vz-machine-state, then you will ask the user if they want to drop the suspended state?

Exactly. For that, I’m thinking of saving the lima.yaml and cidata.iso from the time of startup. Resume would always use those saved versions, while only a cold boot would use the latest lima.yaml.

@jandubois
Copy link
Member

Perhaps it would be better to shut down instead of suspending if the lima.yaml has been modified since startup?

Maybe, but you would lose all your running state/workloads, so shouldn't there be a prompt?

That's why limactl edit requires you to stop first.

Now that I think about it more, I think if you edit lima.yaml outside of limactl edit, then it is your responsibility to do a limactl reboot to trigger the changes to take effect. Lima itself should continue to suspend/resume without worrying about your changes.

@norio-nomura
Copy link
Contributor Author

Would it be better if limactl suspend detects changes to lima.yaml and prompts for either limactl stop or limactl reboot, instead of having hostagent detect it when receiving SIGINT?

@jandubois
Copy link
Member

Would it be better if limactl suspend detects changes to lima.yaml and prompts for either limactl stop or limactl reboot

This does not work if the suspend happens in the background, e.g. because you are logging out. Same thing when you automatically resume. In both case you have no chance to prompt the user.

If the user edits lima.yaml without stopping the instance, then the user is responsible for eventually applying the changes by requesting a shutdown. We can ask when the instance is stopped or resumed, and stdin/stdout are both connected to a TTY. But otherwise the changes should be ignored.

@norio-nomura norio-nomura force-pushed the vz-suspend-resume branch 3 times, most recently from cce12c0 to 8b4dc0f Compare November 16, 2024 01:38
@norio-nomura
Copy link
Contributor Author

Updated PR description.

Copy link
Member

@nirs nirs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automatically saving guest state is nice improvement but it should not be the default due to resource usage and the fact that it is not needed or wanted in many cases.

We should not use suspend/resume for this feature to allow adding support for quick suspend/resume that do not stop the instance. This also match better the VZ API:

templates/default.yaml Outdated Show resolved Hide resolved
pkg/vz/suspend_resume.go Outdated Show resolved Hide resolved
pkg/vz/suspend_resume.go Outdated Show resolved Hide resolved
pkg/vz/suspend_resume.go Outdated Show resolved Hide resolved
@nirs
Copy link
Member

nirs commented Nov 16, 2024

Documentation is missing. It is better to start with documentation, before implementing. This helps to discuss the big picture, evaluate the user experience, and is much quicker to iterate on.

@norio-nomura
Copy link
Contributor Author

updated commits and PR description to aligning the terminology with libvirt:

  • suspend -> save
  • resume -> restore
  • .suspendOnSigInt -> .saveOnStop

@norio-nomura norio-nomura changed the title vz: implement auto suspend/resume vz: implement auto save/restore Nov 18, 2024
@afbjorklund
Copy link
Member

afbjorklund commented Nov 18, 2024

Why not use the same names as in the qemu implementation? (suspend/resume)

We already used "save" (and "load") for saving/loading a snapshot, internally that is.

@norio-nomura
Copy link
Contributor Author

Why not use the same names as in the qemu implementation?

Because its behavior differs from the QEMU implementation.
In the QEMU implementation, suspend does not save the VM state to disk, and resume does not restore the saved VM state.

@afbjorklund
Copy link
Member

afbjorklund commented Nov 18, 2024

Right, then it is a good thing to use different names. Actually the commands were called STOP and CONT in qemu.

For suspending to disk, the state is also called SUSPEND (standby/S3) or SUSPEND_DISK (hibernate/S4). My bad.

- Only `aarch64` is supported due to lack of support in `Code-Hex/vz/v3`.
- Attempt to restore if the `vz-machine-state` file is found in the instance directory. If this fails, proceed with a cold boot.
- On stop, always attempt to save to the `vz-machine-state` file; if this fails, perform a shutdown.
- Guest OS and Guest Agent behavior are not considered here. Handling may be needed for saving while port forwarding is active.

Signed-off-by: Norio Nomura <[email protected]>
@norio-nomura norio-nomura force-pushed the vz-suspend-resume branch 2 times, most recently from a2c6da7 to 46cb122 Compare November 18, 2024 11:24
@norio-nomura
Copy link
Contributor Author

The changes to ensure that the same lima.yaml and cidata.iso are used for save/restore might potentially evolve into a snapshot implementation. Therefore, I would like to address those in a separate PR.

@norio-nomura norio-nomura marked this pull request as ready for review November 18, 2024 11:44
@norio-nomura norio-nomura requested review from AkihiroSuda, nirs, jandubois and afbjorklund and removed request for nirs November 18, 2024 11:45
if err := os.Remove(machineStatePath); err != nil && !errors.Is(err, os.ErrNotExist) {
logrus.WithError(err).Errorf("Failed to remove the old VZ machine state file %q", machineStatePath)
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can write machine state to vz-machine-state.tmp and then replace the current state by renaming vz-machine-state.tmp to vz-machine-state. This ensures that we always have consistent state, either the previous one or the new one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I don't see the value in keeping the vz-machine-state after a resume. This is because the current implementation does not save disk snapshots, and there is no mechanism to revert the guest OS's changes to the diffdisk after a resume to restore the state to match when the vz-machine-state was saved.

By the time we reach this point, the guest OS should already have gone through StatusRunning, so I don't think it's worth complicating things to maintain consistency with the previous vz-machine-state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Maybe add a comment here to explain why we don't keep the state?

pkg/vz/save_restore_arm64.go Show resolved Hide resolved
pkg/vz/save_restore_arm64.go Outdated Show resolved Hide resolved
pkg/vz/save_restore_arm64.go Outdated Show resolved Hide resolved
pkg/vz/save_restore_arm64.go Outdated Show resolved Hide resolved
pkg/vz/save_restore_arm64.go Outdated Show resolved Hide resolved
if err := os.Remove(machineStatePath); err != nil {
// We should log the error but continue the process, because the machine state is already restored
logrus.WithError(err).Errorf("Failed to remove the VZ machine state file %q", machineStatePath)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we always remove the saved state? what if you want to always start with the same saved state?

Copy link
Contributor Author

@norio-nomura norio-nomura Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if you want to always start with the same saved state?

As I mentioned in a reply to another comment earlier, achieving that would require a snapshot feature that includes saving and restoring the diffdisk.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we can add this later.

pkg/vz/vz_driver_darwin.go Outdated Show resolved Hide resolved
logrus.WithError(err).Warn("Failed to save VZ. Falling back to shutdown")
} else {
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code to add saveOnStop must be added before you change Stop(). Each commit should be correct so we can easily bisect issues.

The order of commits should be:

  • Add saveOnStop option (it can do nothing at this point)
  • Save when stopping, considering the new option

If you don't want to add unused option, add the internal flag defaulting to false in the first commit, and expose the option in the second.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code to add saveOnStop must be added before you change Stop().

That should already be the case. Could you double-check to ensure you're not looking at an outdated commit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still the case - the first commit ignore the option added in the second commit. If we need to revert the second commit we change the behavior to always save.

cmd/limactl/copy.go Outdated Show resolved Hide resolved
Signed-off-by: Norio Nomura <[email protected]>
…s to `ha.sock`

- pkg/driver: Add `Driver.RuntimeConfig`
- pkg/hostagent/api/client: Add `HostAgentClient.DriverConfig`
- pkg/hostagent/api/server: Add `Backend.DriverConfig`
- pkg/hostagent: Add `HostAgent.DriverRuntimeConfig`
- pkg/vz: Add `LimaVzDriver.RuntimeConfig`
- pkg/instance: Add `saveOnStop` parameter to `StopGracefully`

Signed-off-by: Norio Nomura <[email protected]>
@norio-nomura
Copy link
Contributor Author

I've applied the review suggestions.
I made changes to past commits and force-pushed, so please fetch the latest and review again.

@norio-nomura norio-nomura requested a review from nirs November 19, 2024 01:35
```console
$ limactl save --help
Save an instance

Usage:
  limactl save INSTANCE [flags]

Flags:
  -h, --help   help for save

Global Flags:
      --debug               debug mode
      --log-format string   Set the logging format [text, json] (default "text")
      --log-level string    Set the logging level [trace, debug, info, warn, error]
      --tty                 Enable TUI interactions such as opening an editor. Defaults to true when stdout is a terminal. Set to false for automation. (default true)
```

Signed-off-by: Norio Nomura <[email protected]>
@norio-nomura
Copy link
Contributor Author

I force-pushed to remove a stray code block that had no effect.

}

// Remove the old machine state file if it exists,
// because saving the machine state will fail if the file already exists.
Copy link
Member

@nirs nirs Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing to avoid failure on save is not a good reason, save can replace existing file.

logrus.WithError(err).Warn("Failed to get macOS product version")
} else if macOSProductVersion.LessThan(*semver.New("14.0.0")) {
logrus.Warn("saveOnStop is not supported on macOS prior to 14.0")
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will more clear if we extract all the checks to a helper like:

if *y.SaveOnStop {
    y.SaveOnStop = validateSaveOnStop(y)
}

In the helper can do something like:

func validateSaveOnStop() *bool {
    notSupported := ptr.Of(false)
    if condition {
        log ...
        return notSupported
    }
    if other condition {
        log ...
        return notSupported
    }
    ...
    return ptr.Of(true)
}

}

func New(driver *driver.BaseDriver) *LimaVzDriver {
return &LimaVzDriver{
BaseDriver: driver,
config: LimaVzDriverRuntimeConfig{
SaveOnStop: *driver.Instance.Config.SaveOnStop,
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preparing for more config options?

logrus.WithError(err).Warn("Failed to save VZ. Falling back to shutdown")
} else {
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still the case - the first commit ignore the option added in the second commit. If we need to revert the second commit we change the behavior to always save.

@@ -43,6 +43,7 @@ type Instance struct {
// Hostname, not HostName (corresponds to SSH's naming convention)
Hostname string `json:"hostname"`
Status Status `json:"status"`
Saved bool `json:"saved"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can document here what saved mean?

return nil, err
}
body = bytes.NewBuffer(b)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing both GET and PATCH in the same function is confusing, both for understanding this code, and both for code using this function. If we want to avoid duplicate code for sending a request and getting a response, move the common code to a helper.

}{SaveOnStop: saveOnStop}
_, err = haClient.DriverConfig(ctx, disableSaveOnStopConfig)
if err != nil {
return fmt.Errorf("failed to disable saveOnStop: %w", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we treating this as disable when the value can be true?

return fmt.Errorf("failed to disable saveOnStop: %w", err)
}
} else if saveOnStop {
return fmt.Errorf("save is not supported for %q", inst.VMType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing input validation first is better and more clear, something like:

if saveOnStop && inst.VMType != limayaml.VZ {
    return fmt.Errorf(...)
}

This keeps the happy path very clear, and it is easy to very that we did not forget to handle all abnormal cases.

return nil, fmt.Errorf("failed to get macOS product version: %w", err)
} else if macOSProductVersion.Major < 14 {
return nil, fmt.Errorf("saveOnStop is not supported on macOS %d", macOSProductVersion.Major)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a copy of the block validating the config in a previous commit. Maybe we can have one place doing this check:

save_restore.Validate(instance) error

Then we can call it in both places,

func newSaveCommand() *cobra.Command {
saveCmd := &cobra.Command{
Use: "save INSTANCE",
Short: "Save an instance",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also stop the instance but the name or the help do not explain this. I think this is confusing user interface even if we improve the help text, and it is better to have:

limactl stop --save

Using --save and --force should be an error.

This also more consistent with saveOnStop instance config. You either enable save on stop for the instance, or for specific stop of the instance.

@@ -523,4 +523,7 @@ func warnExperimental(y *LimaYAML) {
if y.MountInotify != nil && *y.MountInotify {
logrus.Warn("`mountInotify` is experimental")
}
if y.SaveOnStop != nil && *y.SaveOnStop {
logrus.Warn("`saveOnStop` is experimental")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense but we should add it to the commit adding the option. Do yo plan to squash all the commits to one commit later?

@norio-nomura norio-nomura marked this pull request as draft November 28, 2024 00:51
@norio-nomura
Copy link
Contributor Author

I have other priorities to focus on, so I converted this to a draft for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants