Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to improve so we could get results closer to the "regular" VQGAN+CLIP? #14

Open
apolinario opened this issue Sep 6, 2021 · 2 comments

Comments

@apolinario
Copy link

Hi! I really love this idea and think that this concept solves the main bottleneck of current VQGAN+CLIP approach which is the optimisation for each prompt. I love how instantaneous this approach is to generating new images. However results with the different CC12M or blog captions model fall short in comparison to the most recent VQGAN+CLIP optimisation approaches

I am wondering where it could potentially be improved. I guess one thing could be trying to embed the MSE regularised and z+quantize most recent VQGAN+CLIP approaches. The other is that I am wondering whether a bigger training dataset would improve the quality. Would it make sense to train it on ImageNet captions or maybe even a bigger 100M+ caption dataset? (maybe C@H?)

As you can see, I can't actually contribute much (but I could help with a bigger dataset training effort) but I'm cheering for this project to not die!

@apolinario apolinario changed the title Getting results closer to the "regular" VQGAN+CLIP How to improve so we could get results closer to the "regular" VQGAN+CLIP? Sep 6, 2021
@mehdidc
Copy link
Owner

mehdidc commented Sep 13, 2021

Hi @apolinario, thanks for your interest ! indeed the quality does not match the optimization approaches yet, the problem could come from the model architecture that is used and/or the loss function.
There is an issue by @afiaka87 #8 "Positional Stickiness" which mentions one of the problems that seem to be persistent (it seems to happen regardless of model size, data size), and we are still not certain about the reason it happens.

"I guess one thing could be trying to embed the MSE regularised and z+quantize most recent VQGAN+CLIP approaches."

Could you please give more details about this approach or a reference ? I could try it out.

"ImageNet captions" I wasn't aware there are captions for ImageNet, do you have a link or repo?

Thanks

@apolinario
Copy link
Author

Hi @mehdidc, thanks for getting back at this.
So this is a "MSE regularised and Z+quantize VQGAN-CLIP" notebook, there's a debate of whether or not this actually improves quality but it seems to be preferred and widely adopted by some of the digital artists and the EAI community

And yeah, actually "ImageNet captions" don't indeed exist, I just had the naive thought of trying to train it in similar captions of the dataset VQGAN itself was trained without putting more thought into. However, with the release of the first big dataset output from the crawl@home project, I think the LAION-400M or a subset of it could suit very well for training

And thanks for letting me know about the persisten #8 Positional Stickness issue. I noticed a similar behavior while using it. Will try to look into it and also bring some attention to it as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants