Replies: 9 comments 20 replies
-
@sayakpaul did you try to run inference just through the unet (i.e., skip the VAE in case it's using that much memory?) |
Beta Was this translation helpful? Give feedback.
-
@sayakpaul a couple of comments:
|
Beta Was this translation helpful? Give feedback.
-
Cc: @younesbelkada for feedback as well (as he is our in-house ninja for working with reduced precision). |
Beta Was this translation helpful? Give feedback.
-
SD with with batch size of 4
|
Beta Was this translation helpful? Give feedback.
-
SDXL with batch size of 1 (steps: 30)
|
Beta Was this translation helpful? Give feedback.
-
@dacorvo plotted the distribution of the weights of the UNet as well: from diffusers import UNet2DConditionModel
import matplotlib.pyplot as plt
unet = UNet2DConditionModel.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet").eval()
weights = []
for name, param in unet.named_parameters():
if "weight" in name:
weights.append(param.view(-1).cpu().detach().numpy())
plt.figure(figsize=(10, 6))
for i, weight in enumerate(weights):
plt.hist(weight, bins=50, alpha=0.5, label=f'Layer {i+1}')
plt.xlabel('Weight values')
plt.ylabel('Frequency')
plt.title('Distribution of Weights in the Neural Network')
plt.savefig("sdxl_unet_weight_dist.png", bbox_inches="tight", dpi=300) SDXLSD v1.5Weights seem to be concentrated around 0s. Does this quite fit the bill for |
Beta Was this translation helpful? Give feedback.
-
I rebased the branch. I did a refactoring and |
Beta Was this translation helpful? Give feedback.
-
It should be OK now. |
Beta Was this translation helpful? Give feedback.
-
@dacorvo I am getting: RuntimeError: Promotion for Float8 Types is not supported, attempted to promote Float8_e4m3fn and Half |
Beta Was this translation helpful? Give feedback.
-
Comfy and A1111 has been supporting Float8 for some time now:
A1111 reports quite nice improvements for VRAM consumption:
Timing takes a hit because of the casting overhead but that's okay in the interest of the reduced VRAM, IMO.
So, I tried using
qaunto
to potentially benefit from FP8 (benchmark run on 4090):Here are the stats and resultant images (batch size of 1):
As we can see we're able to obtain a good amount of VRAM reduction here in comparison to FP16. Do we want to achieve that in
diffusers
natively, or supporting this viaquanto
is preferable? I am okay with the latter.Edit: int8 is even better: #7023 (comment).
See also: huggingface/optimum-quanto#74. Cc: @dacorvo.
Curious to know your thoughts here: @yiyixuxu @DN6.
Beta Was this translation helpful? Give feedback.
All reactions