Experiment: Prompt enhancer #7321
asomoza
started this conversation in
Show and tell
Replies: 3 comments 1 reply
-
Here's the full code: import torch
from transformers import GenerationConfig, GPT2LMHeadModel, GPT2Tokenizer, LogitsProcessor, LogitsProcessorList
from diffusers import StableDiffusionXLPipeline
styles = {
"cinematic": "cinematic film still of {prompt}, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain",
"anime": "anime artwork of {prompt}, anime style, key visual, vibrant, studio anime, highly detailed",
"photographic": "cinematic photo of {prompt}, 35mm photograph, film, professional, 4k, highly detailed",
"comic": "comic of {prompt}, graphic illustration, comic art, graphic novel art, vibrant, highly detailed",
"lineart": "line art drawing {prompt}, professional, sleek, modern, minimalist, graphic, line art, vector graphics",
"pixelart": " pixel-art {prompt}, low-res, blocky, pixel art style, 8-bit graphics",
}
words = [
"aesthetic",
"astonishing",
"beautiful",
"breathtaking",
"composition",
"contrasted",
"epic",
"moody",
"enhanced",
"exceptional",
"fascinating",
"flawless",
"glamorous",
"glorious",
"illumination",
"impressive",
"improved",
"inspirational",
"magnificent",
"majestic",
"hyperrealistic",
"smooth",
"sharp",
"focus",
"stunning",
"detailed",
"intricate",
"dramatic",
"high",
"quality",
"perfect",
"light",
"ultra",
"highly",
"radiant",
"satisfying",
"soothing",
"sophisticated",
"stylish",
"sublime",
"terrific",
"touching",
"timeless",
"wonderful",
"unbelievable",
"elegant",
"awesome",
"amazing",
"dynamic",
"trendy",
]
word_pairs = ["highly detailed", "high quality", "enhanced quality", "perfect composition", "dynamic light"]
def find_and_order_pairs(s, pairs):
words = s.split()
found_pairs = []
for pair in pairs:
pair_words = pair.split()
if pair_words[0] in words and pair_words[1] in words:
found_pairs.append(pair)
words.remove(pair_words[0])
words.remove(pair_words[1])
for word in words[:]:
for pair in pairs:
if word in pair.split():
words.remove(word)
break
ordered_pairs = ", ".join(found_pairs)
remaining_s = ", ".join(words)
return ordered_pairs, remaining_s
class CustomLogitsProcessor(LogitsProcessor):
def __init__(self, bias):
super().__init__()
self.bias = bias
def __call__(self, input_ids, scores):
if len(input_ids.shape) == 2:
last_token_id = input_ids[0, -1]
self.bias[last_token_id] = -1e10
return scores + self.bias
tokenizer = GPT2Tokenizer.from_pretrained("Gustavosta/MagicPrompt-Stable-Diffusion")
model = GPT2LMHeadModel.from_pretrained("Gustavosta/MagicPrompt-Stable-Diffusion", torch_dtype=torch.float16).to(
"cuda"
)
model.eval()
word_ids = [tokenizer.encode(word, add_prefix_space=True)[0] for word in words]
bias = torch.full((tokenizer.vocab_size,), -float("Inf")).to("cuda")
bias[word_ids] = 0
processor = CustomLogitsProcessor(bias)
proccesor_list = LogitsProcessorList([processor])
prompt = "a road beside a river with trees and a village, studio ghibli style"
style = "anime"
prompt = styles[style].format(prompt=prompt)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
token_count = inputs["input_ids"].shape[1]
max_new_tokens = 40 - token_count
generation_config = GenerationConfig(
penalty_alpha=0.7,
top_k=50,
eos_token_id=model.config.eos_token_id,
pad_token_id=model.config.eos_token_id,
pad_token=model.config.pad_token_id,
do_sample=True,
)
with torch.no_grad():
generated_ids = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
max_new_tokens=max_new_tokens,
generation_config=generation_config,
logits_processor=proccesor_list,
)
output_tokens = [tokenizer.decode(generated_id, skip_special_tokens=True) for generated_id in generated_ids]
input_part, generated_part = output_tokens[0][: len(prompt)], output_tokens[0][len(prompt) :]
pairs, words = find_and_order_pairs(generated_part, word_pairs)
formatted_generated_part = pairs + ", " + words
enhanced_prompt = input_part + ", " + formatted_generated_part
pipe = StableDiffusionXLPipeline.from_pretrained(
"RunDiffusion/Juggernaut-XL-v9", torch_dtype=torch.float16, variant="fp16"
).to("cuda")
pipe.load_lora_weights(
"stabilityai/stable-diffusion-xl-base-1.0",
weight_name="sd_xl_offset_example-lora_1.0.safetensors",
adapter_name="offset",
)
pipe.set_adapters(["offset"], adapter_weights=[0.2])
image = pipe(
enhanced_prompt,
width=1152,
height=896,
guidance_scale=7.5,
num_inference_steps=25,
).images[0]
image.save("image.png") |
Beta Was this translation helpful? Give feedback.
0 replies
-
Holy mother of universe, galaxy, and all the other sentient beings! What is this! Wow! And thank you for sharing your knowledge with this depth! |
Beta Was this translation helpful? Give feedback.
1 reply
-
cc @stevhliu |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I've read people saying that some other apps are just better because they generate good images out of the box, so what I'll try to do here is to show that diffusers can also do the same.
To be fair in the tests I'll use the model Juggernaut-XL-v9 which is what fooocus uses by default I think.
So I'll start with a base image of a cat, one from diffusers as is and one from fooocus to compare it:
Most people would say that the fooocus image is better, to get this kind of image fooocus does a couple of things under the hood.
I don't like to just copy so I'll use my own take at how to do this, but it will be similar to what fooocus does.
For the GPT2 model, I'm going to use this one from Gustavosta that was trained with SD prompts.
If we just use the model as is, it makes the prompt different, for example if we give it the prompt
a cat
it will change the prompt with something like thisphoto of a cat dressed as Gandalf in the Lord of the Rings in the Shire, fantasy, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, art by Tony Sart and artgerm and randy vargas
. This is not what we want because it will change the subject and the style.The trick here is to make it just use the words we want, for this I'll use my own list of about 30 words like this:
Fooocus uses more than 600 for this, you can grab them here
So to be able to do this we need to use a custom
LogitsProcessor
and alogit bias
:What I'm doing here is making the model "biased" towards the words we have in the list by giving them a 0 value while giving the other tokens a negative value so it never picks them in the generation, also after using the words it gives them a negative bias so it doesn't use them again.
Now if I give the model a prompt it will complete it with the words inside the list:
I’m limiting the generation to 50 tokens, with the original prompt included. For me, this is enough, but you can play with the number, especially if you allow more than 75 tokens in the prompt. Note that these tokens aren’t necessarily the same as the SD ones.
But this is not enough yet, with this we have a prompt like this:
So I want to do something extra, I want to pair the words and remove the ones without pairs, so for this I use these pairs and a function:
Now we have a prompt that looks a lot more like what fooocus uses:
And these are some images generated:
I still want a little more, so I'll add a couple of SAI styles and load the offset lora:
Finally we have really good prompt enhancer, these are some tests:
Prompt: a cat
So the result is that we have a image generation script that enhances the prompt even if we just give it a one word prompt like dog, cat, woman, etc. and IMO it compares in quality to what other apps generate without much effort in the prompt.
Using the GPT2 is a good trade off because is a lightweight and fast model, but it's just a little better than adding random words to the prompt, what would be the best solution in this case is to train a bigger and better LLM to do this but some people don't have that much VRAM to spare and won't be able to use it.
This still can be improved with more styles, more steps and maybe do a CFG rescaling, also we could load a enhancer lora that adds more details and we can do a little of color grading, this would be good as a communty pipeline to just use for quick good quality generations.
Beta Was this translation helpful? Give feedback.
All reactions