How to diagnose problems in training custom inpaint model #10002

Marquess98 · 2024-11-22T03:16:50Z

Marquess98
Nov 22, 2024

Discussed in #9989

^{Originally posted by Marquess98 November 22, 2024}
What I want to do is to perform image inpainting when the input is a set of multimodal images, using sdxl as the pre trained model. But the results are very poor now, and I cannot determine whether it is a problem with the code, dataset, pre trained model, or training parameters.
The infer code snipped is as follows:

noise_scheduler = DDIMScheduler.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", subfolder="scheduler")
noise_scheduler.set_timesteps(denoise_steps, device=device)

zi = vae.encode(masked_image).latent_dist.sample()
# zi = vae.encode(masked_image).latent_dist.sample()
zi = zi * vae.config.scaling_factor

zd = vae.encode(img2).latent_dist.sample()
zd = zd * vae.config.scaling_factor

zi_m = vae.encode(masked_image).latent_dist.sample()
zi_m = zi_m * vae.config.scaling_factor

noise = torch.randn_like(zi)
denoise_steps = torch.tensor(denoise_steps,dtype=torch.int32,device=device)
timesteps_add, _  = get_timesteps(noise_scheduler, denoise_steps, 1.0, device, denoising_start=None)
start_step = 5

zi_t = noise_scheduler.add_noise(zi, noise, timesteps_add[start_step])  
# mask = mask.unsqueeze(1)
m = F.interpolate(mask.to(zi.dtype), size=(zi.shape[2], zi.shape[3]), 
                    mode='bilinear', align_corners=False)

input_ids = dataset["prompt_ids"].to(device)
input_ids = input_ids.unsqueeze(0)
encoder_hidden_states = text_encoder(input_ids, return_dict=False)[0]

timesteps = noise_scheduler.timesteps
iterable = tqdm(
    enumerate(timesteps),
    total=len(timesteps),
    leave=False,
    desc=" " * 4 + "Diffusion denoising",
)
# iterable = enumerate(timesteps)
start_step = 1
# -----------------------denoise------------------------
for i, t in iterable:
    if i >= start_step:
        unet_input = torch.cat([zi_t, zi_m, zd, m], dim=1)      
        with torch.no_grad():
            noise_pred = unet(unet_input, t, 
                                encoder_hidden_states)[0]
        zi_t = noise_scheduler.step(noise_pred, t, zi_t).prev_sample

# torch.cuda.empty_cache()
decode_rgb = vae.decode(zi_t / vae.config.scaling_factor)
decode_rgb = decode_rgb['sample'].squeeze()

And the results of different start_steps are as follow:[0, 5, 15 respectively]

Another wired thing is the decoder_rgb range is about [-2, 2], Shouldn't its range be [-1, 1] ?
Currently, I think the problem may lie in either the infer code or the scale of dataset（about 5000 sets images so far）. Can someone guide me on how to determine which part of the problem it is?
Any suggestions and ideas will be greatly appreciated !!!!

Marquess98 · 2024-11-22T03:23:48Z

Marquess98
Nov 22, 2024
Author

ps: the gt image is like below

and I have tried the sdxl and sdv1.4 as the pre_trained models, both results are similarly bad

0 replies

sayakpaul · 2024-11-23T13:37:34Z

sayakpaul
Nov 23, 2024
Maintainer

This more seems like a discussion and not an issue. So, transferring this to "discussions".

1 reply

Marquess98 Nov 28, 2024
Author

can you give me some advice over the problem? I am trying to scale up the dataset. Thank you so much

Marquess98 · 2024-12-06T07:36:19Z

Marquess98
Dec 6, 2024
Author

update:
I change the code ：unet_input = torch.cat([zi_t, zi_m, zd, m], dim=1) to unet_input = torch.cat([zi_t, zd, m, zi_m], dim=1)， althrough the model finally learned to generate pictures in masked areas, but the entire image still blurry and seems it cannot get correct information for the input image zd. bellow are some infer results with different strength value[0.1, 0.3, 0.7, 1.0]

]

can someone give me some insight? Thank you!

1 reply

Marquess98 Dec 10, 2024
Author

What confuses me the most now is why, when the strength is high, although the image becomes clearer, it is completely unrelated to the input image; When the strength is small, although the structure of the input image can be seen, the overall image is very blurry.
Is it a problem with the VAE encoder/decoder, or is it due to an additional input zd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to diagnose problems in training custom inpaint model #10002

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to diagnose problems in training custom inpaint model #10002

Marquess98 Nov 22, 2024

Discussed in #9989

Replies: 3 comments · 2 replies

Marquess98 Nov 22, 2024 Author

sayakpaul Nov 23, 2024 Maintainer

Marquess98 Nov 28, 2024 Author

Marquess98 Dec 6, 2024 Author

Marquess98 Dec 10, 2024 Author

Marquess98
Nov 22, 2024

Replies: 3 comments 2 replies

Marquess98
Nov 22, 2024
Author

sayakpaul
Nov 23, 2024
Maintainer

Marquess98 Nov 28, 2024
Author

Marquess98
Dec 6, 2024
Author

Marquess98 Dec 10, 2024
Author