Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in preview code. #6124

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

AustinMroz
Copy link

@AustinMroz AustinMroz commented Dec 19, 2024

In the current preview code, when possible, a non-blocking to operation is performed and, immediately after, the output tensor is used to create an image. If this non-blocking operation has not completed, PIL makes a copy of the uninitialized memory to produce an image. Generally, this will either contain zeros, or the result of a previously generated preview (Sometimes from an entirely different execution). This results in both incorrect output, and wasted computation (unless the memory this output was eventually copied to is reallocated and displayed instead of a future preview).

To resolve this, the state of the preview generation is tracked with an event.

  • The PIL image is created with no copy
  • The preview image is not sent to from the server until ready
  • Completion of this event is polled with a reasonably slow frequency
  • A new preview is not created if a previous preview has not completed

On my system (linux, 2.5.0.dev20240805+rocm6.1, 7900 GRE), I'm seeing no difference in the execution time for latent2rgb and minute performance improvement (~5%) for taesd, but note that erroneous previews were more commonly produced when latent2rgb was used.

EDIT: Fix width and height being swapped in frombuffer call. Minor memory optimization by eliminating tensor concatenation.

In the previous preview code, when possible, a non-blocking `to`
operation is performed and, immediately after, the output tensor is used
to create an image. If this non-blocking operation has not completed,
PIL makes a copy of the uninitialized memory to produce an image.
Generally, this will either contain zeros, or the result of a previously
generated preview. This results in both incorrect output, and wasted
computation (unless the memory this output was eventually copied to is
reallocated and displayed instead of a future preview).

To resolve this, the state of the preview generation is tracked with an
event.
- The PIL image is created with no copy
- The preview image is not sent to from the server until ready
- Completion of this event is polled with a reasonably slow frequency
- A new preview is not created if a previous preview has not completed
@comfyanonymous
Copy link
Owner

This breaks things on non cuda platforms like mps.

@AustinMroz
Copy link
Author

Much appreciated. Is device_supports_non_blocking sufficient for determining if a devices support events as so?

@comfyanonymous
Copy link
Owner

There seems to be torch.xpu.event and torch.mps.event specifically for those platforms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants