Experiment: Prompt enhancer #7321

asomoza · 2024-03-14T10:53:05Z

asomoza
Mar 14, 2024
Maintainer

I've read people saying that some other apps are just better because they generate good images out of the box, so what I'll try to do here is to show that diffusers can also do the same.

To be fair in the tests I'll use the model Juggernaut-XL-v9 which is what fooocus uses by default I think.

So I'll start with a base image of a cat, one from diffusers as is and one from fooocus to compare it:

Fooocus	Diffusers

Most people would say that the fooocus image is better, to get this kind of image fooocus does a couple of things under the hood.

it always loads the offset lora with a low weight
it uses a GPT2 model to enhance the prompt with some words.

I don't like to just copy so I'll use my own take at how to do this, but it will be similar to what fooocus does.

For the GPT2 model, I'm going to use this one from Gustavosta that was trained with SD prompts.

If we just use the model as is, it makes the prompt different, for example if we give it the prompt a cat it will change the prompt with something like this photo of a cat dressed as Gandalf in the Lord of the Rings in the Shire, fantasy, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, art by Tony Sart and artgerm and randy vargas. This is not what we want because it will change the subject and the style.

The trick here is to make it just use the words we want, for this I'll use my own list of about 30 words like this:

words = ["aesthetic", "astonishing", "beautiful", "breathtaking", "composition", ...., "intricate", "awesome", "trendy"]

Fooocus uses more than 600 for this, you can grab them here

So to be able to do this we need to use a custom LogitsProcessor and a logit bias:

class CustomLogitsProcessor(LogitsProcessor):
    def __init__(self, bias):
        super().__init__()
        self.bias = bias

    def __call__(self, input_ids, scores):
        if len(input_ids.shape) == 2:
            last_token_id = input_ids[0, -1]
            self.bias[last_token_id] = -1e10
        return scores + self.bias

word_ids = [tokenizer.encode(word, add_prefix_space=True)[0] for word in words]
bias = torch.full((tokenizer.vocab_size,), -float("Inf")).to("cuda")
bias[word_ids] = 0
processor = CustomLogitsProcessor(bias)
proccesor_list = LogitsProcessorList([processor])

What I'm doing here is making the model "biased" towards the words we have in the list by giving them a 0 value while giving the other tokens a negative value so it never picks them in the generation, also after using the words it gives them a negative bias so it doesn't use them again.

Now if I give the model a prompt it will complete it with the words inside the list:

prompt = "a cat"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
token_count = inputs["input_ids"].shape[1]
max_new_tokens = 50 - token_count

generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    max_new_tokens=max_new_tokens,
    generation_config=generation_config,
    logits_processor=proccesor_list,
)

I’m limiting the generation to 50 tokens, with the original prompt included. For me, this is enough, but you can play with the number, especially if you allow more than 75 tokens in the prompt. Note that these tokens aren’t necessarily the same as the SD ones.

But this is not enough yet, with this we have a prompt like this:

a cat beautiful awesome aesthetic inspirational hyper detailed ultra sharp focus epic breathtaking perfect smooth high quality elegant wonderful dynamic mood amazing intricate stunning highly enhanced stylish composition glorious light fascinating astonishing impressive exceptional dramatic sublime magnificent majestic illumination terrific unbelievable flawless radiant satisfying touching sophisticated trendy glamorous timeless improved

So I want to do something extra, I want to pair the words and remove the ones without pairs, so for this I use these pairs and a function:

word_pairs = ["highly detailed", "high quality", "enhanced quality", "perfect composition", "dynamic light"]

def find_and_order_pairs(s, pairs):
    words = s.split()
    found_pairs = []
    for pair in pairs:
        pair_words = pair.split()
        if pair_words[0] in words and pair_words[1] in words:
            found_pairs.append(pair)
            words.remove(pair_words[0])
            words.remove(pair_words[1])

    for word in words[:]:
        for pair in pairs:
            if word in pair.split():
                words.remove(word)
                break
    ordered_pairs = ", ".join(found_pairs)
    remaining_s = ", ".join(words)
    return ordered_pairs, remaining_s

Now we have a prompt that looks a lot more like what fooocus uses:

a cat, highly detailed, high quality, perfect composition, dynamic light, focus, beautiful, intricate, sharp, hyper, smooth, dramatic, mood, epic, stunning, breathtaking, flawless, elegant, contrasted, amazing, magnificent, impressive, exceptional, astonishing, wonderful, awesome, inspirational, fascinating, glorious, sublime, terrific, aesthetic, majestic, ultra

And these are some images generated:

I still want a little more, so I'll add a couple of SAI styles and load the offset lora:

styles = {
    "cinematic": "cinematic film still of {prompt}, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain",
    "anime": "anime artwork of {prompt}, anime style, key visual, vibrant, studio anime, highly detailed",
    "photographic": "cinematic photo of {prompt}, 35mm photograph, film, professional, 4k, highly detailed",
    "comic": "comic of {prompt}, graphic illustration, comic art, graphic novel art, vibrant, highly detailed",
    "lineart": "line art drawing {prompt}, professional, sleek, modern, minimalist, graphic, line art, vector graphics",
    "pixelart": " pixel-art {prompt}, low-res, blocky, pixel art style, 8-bit graphics",
}

prompt = "a cat"
style = "cinematic"
prompt = styles[style].format(prompt=prompt)

pipe.load_lora_weights(
    "stabilityai/stable-diffusion-xl-base-1.0",
    weight_name="sd_xl_offset_example-lora_1.0.safetensors",
    adapter_name="offset",
)
pipe.set_adapters(["offset"], adapter_weights=[0.2])

Finally we have really good prompt enhancer, these are some tests:

Prompt: a cat

cinematic	anime	photographic	comic	lineart	pixelart

A dog running at the park	A couple walking on the sidewalk at dawn, city

a biker riding his motorbike on the highway	a road beside a river with trees and a village, studio ghibli style

So the result is that we have a image generation script that enhances the prompt even if we just give it a one word prompt like dog, cat, woman, etc. and IMO it compares in quality to what other apps generate without much effort in the prompt.

Using the GPT2 is a good trade off because is a lightweight and fast model, but it's just a little better than adding random words to the prompt, what would be the best solution in this case is to train a bigger and better LLM to do this but some people don't have that much VRAM to spare and won't be able to use it.

This still can be improved with more styles, more steps and maybe do a CFG rescaling, also we could load a enhancer lora that adds more details and we can do a little of color grading, this would be good as a communty pipeline to just use for quick good quality generations.

asomoza · 2024-03-14T10:57:26Z

asomoza
Mar 14, 2024
Maintainer Author

Here's the full code:

import torch
from transformers import GenerationConfig, GPT2LMHeadModel, GPT2Tokenizer, LogitsProcessor, LogitsProcessorList

from diffusers import StableDiffusionXLPipeline


styles = {
    "cinematic": "cinematic film still of {prompt}, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain",
    "anime": "anime artwork of {prompt}, anime style, key visual, vibrant, studio anime, highly detailed",
    "photographic": "cinematic photo of {prompt}, 35mm photograph, film, professional, 4k, highly detailed",
    "comic": "comic of {prompt}, graphic illustration, comic art, graphic novel art, vibrant, highly detailed",
    "lineart": "line art drawing {prompt}, professional, sleek, modern, minimalist, graphic, line art, vector graphics",
    "pixelart": " pixel-art {prompt}, low-res, blocky, pixel art style, 8-bit graphics",
}

words = [
    "aesthetic",
    "astonishing",
    "beautiful",
    "breathtaking",
    "composition",
    "contrasted",
    "epic",
    "moody",
    "enhanced",
    "exceptional",
    "fascinating",
    "flawless",
    "glamorous",
    "glorious",
    "illumination",
    "impressive",
    "improved",
    "inspirational",
    "magnificent",
    "majestic",
    "hyperrealistic",
    "smooth",
    "sharp",
    "focus",
    "stunning",
    "detailed",
    "intricate",
    "dramatic",
    "high",
    "quality",
    "perfect",
    "light",
    "ultra",
    "highly",
    "radiant",
    "satisfying",
    "soothing",
    "sophisticated",
    "stylish",
    "sublime",
    "terrific",
    "touching",
    "timeless",
    "wonderful",
    "unbelievable",
    "elegant",
    "awesome",
    "amazing",
    "dynamic",
    "trendy",
]

word_pairs = ["highly detailed", "high quality", "enhanced quality", "perfect composition", "dynamic light"]


def find_and_order_pairs(s, pairs):
    words = s.split()
    found_pairs = []
    for pair in pairs:
        pair_words = pair.split()
        if pair_words[0] in words and pair_words[1] in words:
            found_pairs.append(pair)
            words.remove(pair_words[0])
            words.remove(pair_words[1])

    for word in words[:]:
        for pair in pairs:
            if word in pair.split():
                words.remove(word)
                break
    ordered_pairs = ", ".join(found_pairs)
    remaining_s = ", ".join(words)
    return ordered_pairs, remaining_s


class CustomLogitsProcessor(LogitsProcessor):
    def __init__(self, bias):
        super().__init__()
        self.bias = bias

    def __call__(self, input_ids, scores):
        if len(input_ids.shape) == 2:
            last_token_id = input_ids[0, -1]
            self.bias[last_token_id] = -1e10
        return scores + self.bias


tokenizer = GPT2Tokenizer.from_pretrained("Gustavosta/MagicPrompt-Stable-Diffusion")
model = GPT2LMHeadModel.from_pretrained("Gustavosta/MagicPrompt-Stable-Diffusion", torch_dtype=torch.float16).to(
    "cuda"
)
model.eval()

word_ids = [tokenizer.encode(word, add_prefix_space=True)[0] for word in words]
bias = torch.full((tokenizer.vocab_size,), -float("Inf")).to("cuda")
bias[word_ids] = 0
processor = CustomLogitsProcessor(bias)
proccesor_list = LogitsProcessorList([processor])

prompt = "a road beside a river with trees and a village, studio ghibli style"
style = "anime"

prompt = styles[style].format(prompt=prompt)

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
token_count = inputs["input_ids"].shape[1]
max_new_tokens = 40 - token_count

generation_config = GenerationConfig(
    penalty_alpha=0.7,
    top_k=50,
    eos_token_id=model.config.eos_token_id,
    pad_token_id=model.config.eos_token_id,
    pad_token=model.config.pad_token_id,
    do_sample=True,
)

with torch.no_grad():
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_new_tokens=max_new_tokens,
        generation_config=generation_config,
        logits_processor=proccesor_list,
    )

output_tokens = [tokenizer.decode(generated_id, skip_special_tokens=True) for generated_id in generated_ids]
input_part, generated_part = output_tokens[0][: len(prompt)], output_tokens[0][len(prompt) :]
pairs, words = find_and_order_pairs(generated_part, word_pairs)
formatted_generated_part = pairs + ", " + words
enhanced_prompt = input_part + ", " + formatted_generated_part

pipe = StableDiffusionXLPipeline.from_pretrained(
    "RunDiffusion/Juggernaut-XL-v9", torch_dtype=torch.float16, variant="fp16"
).to("cuda")

pipe.load_lora_weights(
    "stabilityai/stable-diffusion-xl-base-1.0",
    weight_name="sd_xl_offset_example-lora_1.0.safetensors",
    adapter_name="offset",
)
pipe.set_adapters(["offset"], adapter_weights=[0.2])

image = pipe(
    enhanced_prompt,
    width=1152,
    height=896,
    guidance_scale=7.5,
    num_inference_steps=25,
).images[0]

image.save("image.png")

0 replies

sayakpaul · 2024-03-15T03:13:37Z

sayakpaul
Mar 15, 2024
Maintainer

Holy mother of universe, galaxy, and all the other sentient beings!

What is this! Wow!

And thank you for sharing your knowledge with this depth!

1 reply

ritwikraha Mar 15, 2024

I agree, this is a really high quality discussion. Thank you for sharing this ❤️❤️

yiyixuxu · 2024-03-15T04:21:09Z

yiyixuxu
Mar 15, 2024
Maintainer

cc @stevhliu
our doc needs this 🤩🤩

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment: Prompt enhancer #7321

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Experiment: Prompt enhancer #7321

asomoza Mar 14, 2024 Maintainer

Prompt: a cat

Replies: 3 comments · 1 reply

asomoza Mar 14, 2024 Maintainer Author

sayakpaul Mar 15, 2024 Maintainer

ritwikraha Mar 15, 2024

yiyixuxu Mar 15, 2024 Maintainer

asomoza
Mar 14, 2024
Maintainer

Replies: 3 comments 1 reply

asomoza
Mar 14, 2024
Maintainer Author

sayakpaul
Mar 15, 2024
Maintainer

yiyixuxu
Mar 15, 2024
Maintainer