Flux inference og #9465

echosyy · 2024-09-18T11:00:32Z

echosyy
Sep 18, 2024

Describe the bug

I use the optimization.quanto package to call the quantization function. When the model are quantized to fp8, the speed is much slower than bf16. want to know why, thank you?

Reproduction

`transformer = FluxTransformer2DModel.from_single_file("https://huggingface.co/Kijai/flux-fp8/blob/main/flux1-dev-fp8.safetensors", torch_dtype=dtype)
quantize(transformer, weights=qfloat8)
freeze(transformer)

text_encoder_2 = T5EncoderModel.from_pretrained(bfl_repo, subfolder="text_encoder_2", torch_dtype=dtype)
quantize(text_encoder_2, weights=qfloat8)
freeze(text_encoder_2)

pipe = FluxPipeline.from_pretrained(bfl_repo, transformer=None, text_encoder_2=None, torch_dtype=dtype)
pipe.transformer = transformer
pipe.text_encoder_2 = text_encoder_2`

Logs

No response

System Info

X86, torch2.4+cuda12.2

Who can help?

No response

asomoza · 2024-09-18T16:41:29Z

asomoza
Sep 18, 2024
Maintainer

how much slower? I think this is kind of expected but ccing @sayakpaul for more insights.

1 reply

echosyy Sep 19, 2024
Author

bf16 is 1.8iter/s, but fp8 is 1.1iter/s,
device: L40s

sayakpaul · 2024-09-19T02:27:37Z

sayakpaul
Sep 19, 2024
Maintainer

Are you not popping your model to GPU? I don't see any placements.

1 reply

echosyy Sep 19, 2024
Author

yes, ⬇️ is the complete code.

`transformer = FluxTransformer2DModel.from_single_file("https://huggingface.co/Kijai/flux-fp8/blob/main/flux1-dev-fp8.safetensors", torch_dtype=dtype)
quantize(transformer, weights=qfloat8)
freeze(transformer)

transformer = torch.ao.quantization.quantize_dynamic(
transformer, # the original model
{torch.nn.Linear}, # a set of layers to dynamically quantize
dtype=torch.qint8) # the target dtype for quantized weights

text_encoder_2 = T5EncoderModel.from_pretrained(bfl_repo, subfolder="text_encoder_2", torch_dtype=dtype)
quantize(text_encoder_2, weights=qfloat8)
freeze(text_encoder_2)

pipe = FluxPipeline.from_pretrained(bfl_repo, transformer=None, text_encoder_2=None, torch_dtype=dtype)
pipe.transformer = transformer
pipe.text_encoder_2 = text_encoder_2

pipe = pipe.to('cuda')
prompt = "A cat holding a sign that says hello world"

image = pipe(
prompt,
height=1024,
width=1024,
guidance_scale=3.5,
num_inference_steps=20,
max_sequence_length=512,
generator=torch.Generator("cuda").manual_seed(0)).images[0]`

sayakpaul · 2024-09-19T02:28:37Z

sayakpaul
Sep 19, 2024
Maintainer

Also, going to move this to discussions as this is not a library issue.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flux inference og #9465

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Flux inference og #9465

echosyy Sep 18, 2024

Describe the bug

Reproduction

Logs

System Info

Who can help?

Replies: 3 comments · 2 replies

asomoza Sep 18, 2024 Maintainer

echosyy Sep 19, 2024 Author

sayakpaul Sep 19, 2024 Maintainer

echosyy Sep 19, 2024 Author

sayakpaul Sep 19, 2024 Maintainer

echosyy
Sep 18, 2024

Replies: 3 comments 2 replies

asomoza
Sep 18, 2024
Maintainer

echosyy Sep 19, 2024
Author

sayakpaul
Sep 19, 2024
Maintainer

echosyy Sep 19, 2024
Author

sayakpaul
Sep 19, 2024
Maintainer