Why padding="max_length" is needed for FLUX and SD3? #10177

ilya-lavrenov · 2024-12-10T18:14:31Z

ilya-lavrenov
Dec 10, 2024

From the code block

diffusers/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py

Lines 235 to 242 in 22d3a82

    
           text_inputs = self.tokenizer_3( 
        
               prompt, 
        
               padding="max_length", 
        
               max_length=max_sequence_length, 
        
               truncation=True, 
        
               add_special_tokens=True, 
        
               return_tensors="pt", 
        
           )

we can see that T5 tokenizer's output has max_sequence_length tokens where actual output is padded to be of that size.

Why is it required? With actual number of tokenized tokens you could save T5 encoder inference time and then, Transformer model inference as well. According our experiments we have the following breakdown:

input_ids shape [1,512]
t5_text_encoder infer time: 5198.47 ms
transformer infer time: 19680.3 ms
transformer infer time: 16159.3 ms
transformer infer time: 16201.1 ms
transformer infer time: 16200 ms
transformer infer time: 16272.8 ms
 
input_ids shape [1,11]
t5_text_encoder infer time: 1036.53 ms
transformer infer time: 12884.1 ms
transformer infer time: 10211 ms
transformer infer time: 10214 ms
transformer infer time: 10230.1 ms
transformer infer time: 10239.3 ms

where we can see that actual number of tokens is 1.6x faster, while output image is slightly different, but it's still corresponds to text prompts

Seq len 64