You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! I really love this idea and think that this concept solves the main bottleneck of current VQGAN+CLIP approach which is the optimisation for each prompt. I love how instantaneous this approach is to generating new images. However results with the different CC12M or blog captions model fall short in comparison to the most recent VQGAN+CLIP optimisation approaches
I am wondering where it could potentially be improved. I guess one thing could be trying to embed the MSE regularised and z+quantize most recent VQGAN+CLIP approaches. The other is that I am wondering whether a bigger training dataset would improve the quality. Would it make sense to train it on ImageNet captions or maybe even a bigger 100M+ caption dataset? (maybe C@H?)
As you can see, I can't actually contribute much (but I could help with a bigger dataset training effort) but I'm cheering for this project to not die!
The text was updated successfully, but these errors were encountered:
apolinario
changed the title
Getting results closer to the "regular" VQGAN+CLIP
How to improve so we could get results closer to the "regular" VQGAN+CLIP?
Sep 6, 2021
Hi @apolinario, thanks for your interest ! indeed the quality does not match the optimization approaches yet, the problem could come from the model architecture that is used and/or the loss function.
There is an issue by @afiaka87#8 "Positional Stickiness" which mentions one of the problems that seem to be persistent (it seems to happen regardless of model size, data size), and we are still not certain about the reason it happens.
"I guess one thing could be trying to embed the MSE regularised and z+quantize most recent VQGAN+CLIP approaches."
Could you please give more details about this approach or a reference ? I could try it out.
"ImageNet captions" I wasn't aware there are captions for ImageNet, do you have a link or repo?
Hi @mehdidc, thanks for getting back at this.
So this is a "MSE regularised and Z+quantize VQGAN-CLIP" notebook, there's a debate of whether or not this actually improves quality but it seems to be preferred and widely adopted by some of the digital artists and the EAI community
And yeah, actually "ImageNet captions" don't indeed exist, I just had the naive thought of trying to train it in similar captions of the dataset VQGAN itself was trained without putting more thought into. However, with the release of the first big dataset output from the crawl@home project, I think the LAION-400M or a subset of it could suit very well for training
And thanks for letting me know about the persisten #8 Positional Stickness issue. I noticed a similar behavior while using it. Will try to look into it and also bring some attention to it as well
Hi! I really love this idea and think that this concept solves the main bottleneck of current VQGAN+CLIP approach which is the optimisation for each prompt. I love how instantaneous this approach is to generating new images. However results with the different CC12M or blog captions model fall short in comparison to the most recent VQGAN+CLIP optimisation approaches
I am wondering where it could potentially be improved. I guess one thing could be trying to embed the MSE regularised and z+quantize most recent VQGAN+CLIP approaches. The other is that I am wondering whether a bigger training dataset would improve the quality. Would it make sense to train it on ImageNet captions or maybe even a bigger 100M+ caption dataset? (maybe C@H?)
As you can see, I can't actually contribute much (but I could help with a bigger dataset training effort) but I'm cheering for this project to not die!
The text was updated successfully, but these errors were encountered: