17 Jan 16:48

patrickvonplaten

1909952

Patch release

Make sure diffusers can correctly be used in offline mode again: #1767 (comment)

Respect offline mode when loading pipeline by @Wauplin in #6456
Fix offline mode import by @Wauplin in #6467

Contributors

Wauplin

Assets 2

27 Dec 13:49

sayakpaul

v0.25.0

7f551e2

v0.25.0: aMUSEd, faster SDXL, interruptable pipelines

aMUSEd

aMUSEd is a lightweight text to image model based off of the MUSE architecture. aMUSEd is particularly useful in applications that require a lightweight and fast model, such as generating many images quickly at once. aMUSEd is currently a research release.

aMUSEd is a VQVAE token-based transformer that can generate an image in fewer forward passes than many diffusion models. In contrast with MUSE, it uses the smaller text encoder CLIP-L/14 instead of T5-XXL. Due to its small parameter count and few forward pass generation process, amused can generate many images quickly. This benefit is seen particularly at larger batch sizes.

Text-to-image generation

import torch
from diffusers import AmusedPipeline

pipe = AmusedPipeline.from_pretrained(
    "amused/amused-512", variant="fp16", torch_dtype=torch.float16
)
pipe = pipe.to("cuda")

prompt = "cowboy"
image = pipe(prompt, generator=torch.manual_seed(8)).images[0]
image.save("text2image_512.png")

Image-to-image generation

import torch
from diffusers import AmusedImg2ImgPipeline
from diffusers.utils import load_image

pipe = AmusedImg2ImgPipeline.from_pretrained(
    "amused/amused-512", variant="fp16", torch_dtype=torch.float16
)
pipe = pipe.to("cuda")

prompt = "apple watercolor"
input_image = (
    load_image(
        "https://huggingface.co/amused/amused-512/resolve/main/assets/image2image_256_orig.png"
    )
    .resize((512, 512))
    .convert("RGB")
)

image = pipe(prompt, input_image, strength=0.7, generator=torch.manual_seed(3)).images[0]
image.save("image2image_512.png")

Inpainting

import torch
from diffusers import AmusedInpaintPipeline
from diffusers.utils import load_image
from PIL import Image

pipe = AmusedInpaintPipeline.from_pretrained(
    "amused/amused-512", variant="fp16", torch_dtype=torch.float16
)
pipe = pipe.to("cuda")

prompt = "a man with glasses"
input_image = (
    load_image(
        "https://huggingface.co/amused/amused-512/resolve/main/assets/inpainting_256_orig.png"
    )
    .resize((512, 512))
    .convert("RGB")
)
mask = (
    load_image(
        "https://huggingface.co/amused/amused-512/resolve/main/assets/inpainting_256_mask.png"
    )
    .resize((512, 512))
    .convert("L")
)    

image = pipe(prompt, input_image, mask, generator=torch.manual_seed(3)).images[0]
image.save(f"inpainting_512.png")

📜 Docs: https://huggingface.co/docs/diffusers/main/en/api/pipelines/amused

🛠️ Models:

mused-256: https://huggingface.co/amused/amused-256 (603M params)
amused-512: https://huggingface.co/amused/amused-512 (608M params)

Faster SDXL

We’re excited to present an array of optimization techniques that can be used to accelerate the inference latency of text-to-image diffusion models. All of these can be done in native PyTorch without requiring additional C++ code.

These techniques are not specific to Stable Diffusion XL (SDXL) and can be used to improve other text-to-image diffusion models too. Starting from default fp32 precision, we can achieve a 3x speed improvement by applying different PyTorch optimization techniques. We encourage you to check out the detailed docs provided below.

Note: Compared to the default way most people use Diffusers which is fp16 + SDPA, applying all the optimization explained in the blog below yields a 30% speed-up.

📜 Docs: https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion
🌠 PyTorch blog post: https://pytorch.org/blog/accelerating-generative-ai-3/

Interruptible pipelines

Interrupting the diffusion process is particularly useful when building UIs that work with Diffusers because it allows users to stop the generation process if they're unhappy with the intermediate results. You can incorporate this into your pipeline with a callback.

This callback function should take the following arguments: pipe, i, t, and callback_kwargs (this must be returned). Set the pipeline's _interrupt attribute to True to stop the diffusion process after a certain number of steps. You are also free to implement your own custom stopping logic inside the callback.

In this example, the diffusion process is stopped after 10 steps even though num_inference_steps is set to 50.

from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
pipe.enable_model_cpu_offload()
num_inference_steps = 50

def interrupt_callback(pipe, i, t, callback_kwargs):
    stop_idx = 10
    if i == stop_idx:
        pipe._interrupt = True

    return callback_kwargs

pipe(
    "A photo of a cat",
    num_inference_steps=num_inference_steps,
    callback_on_step_end=interrupt_callback,
)

📜 Docs: https://huggingface.co/docs/diffusers/main/en/using-diffusers/callback

`peft` in our LoRA training examples

We incorporated peft in all the officially supported training examples concerning LoRA. This greatly simplifies the code and improves readability. LoRA training hasn't been easier, thanks to peft!

More memory-friendly version of LCM LoRA SDXL training

We incorporated best practices from peft to make LCM LoRA training for SDXL more memory-friendly. As such, you don't have to initialize two UNets (teacher and student) anymore. This version also integrates with the datasets library for quick experimentation. Check out this section for more details.

All commits

[docs] Fix video link by @stevhliu in #5986
Fix LLMGroundedDiffusionPipeline super class arguments by @KristianMischke in #5993
Remove a duplicated line? by @sweetcocoa in #6010
[examples/advanced_diffusion_training] bug fixes and improvements for LoRA Dreambooth SDXL advanced training script by @linoytsaban in #5935
[advanced_dreambooth_lora_sdxl_tranining_script] readme fix by @linoytsaban in #6019
[docs] Fix SVD video by @stevhliu in #6004
[Easy] minor edits to setup.py by @sayakpaul in #5996
[From Single File] Allow Text Encoder to be passed by @patrickvonplaten in #6020
[Community Pipeline] Regional Prompting Pipeline by @hako-mikan in #6015
[logging] Fix assertion bug by @StandardAI in #6012
[Docs] Update a link by @StandardAI in #6014
added attention_head_dim, attention_type, resolution_idx by @charchit7 in #6011
fix style by @patrickvonplaten (direct commit on v0.25.0)
[Kandinsky 3.0] Follow-up TODOs by @yiyixuxu in #5944
[schedulers] create self.sigmas during init by @yiyixuxu in #6006
Post Release: v0.24.0 by @patrickvonplaten in #5985
LLMGroundedDiffusionPipeline: inherit from DiffusionPipeline and fix peft by @TonyLianLong in #6023
adapt PixArtAlphaPipeline for pixart-lcm model by @lawrence-cj in #5974
[PixArt Tests] remove fast tests from slow suite by @sayakpaul in #5945
[LoRA serialization] fix: duplicate unet prefix problem. by @sayakpaul in #5991
[advanced dreambooth lora sdxl training script] improve help tags by @linoytsaban in #6035
fix StableDiffusionTensorRT super args error by @gujingit in #6009
Update value_guided_sampling.py by @Parth38 in #6027
Update Tests Fetcher by @DN6 in #5950
Add variant argument to dreambooth lora sdxl advanced by @levi in #6021
[Feature] Support IP-Adapter Plus by @okotaku in #5915
[Community Pipeline] DemoFusion: Democratising High-Resolution Image Generation With No $$$ by @RuoyiDu in #6022
[advanced dreambooth lora training script][bug_fix] change token_abstraction type to str by @linoytsaban in #6040
[docs] Add Kandinsky 3 by @stevhliu in #5988
[docs] #Copied from mechanism by @stevhliu in #6007
Move kandinsky convert script by @DN6 in #6047
Pin Ruff Version by @DN6 in #6059
Ldm unet convert fix by @DN6 in #6038
Fix demofusion by @radames in #6049
[From single file] remove depr warning by @patrickvonplaten in #6043
[advanced_dreambooth_lora_sdxl_tranining_script] save embeddings locally fix by @apolinario in #6058
Device agnostic testing by @arsalanu in #5612
[feat] allow SDXL pipeline to run with fused QKV projections by @sayakpaul in #6030
fix by @DN6 (direct commit on v0.25.0)
Use CC12M for LCM WDS training example by @pcuenca in #5908
Disable Tests Fetcher by @DN6 in #6060
[Advanced Diffusion Training] Cache latents to avoid VAE passes for every training step by @apolinario in #6076
[Euler Discrete] Fix sigma by @patrickvonplaten in #6078
Harmonize HF environment variables + deprecate use_auth_token by @Wauplin in #6066
[docs] SDXL Turbo by @stevhliu in #6065
Add ControlNet-XS support by @UmerHA in #5827
Fix typing inconsistency in Euler discrete scheduler by @iabaldwin in #6052
[PEFT] Adapt example scripts to use PEFT by @younesbelkada in #5388
Fix clearing backend cache from device agnostic testing by @DN6 in #6075
[Community] AnimateDiff + Controlnet Pipeline by @a-r-r-o-w in #5928
EulerDiscreteScheduler add rescale_betas_zero_snr by @Beinsezii in #6024
Add support for IPAdapterFull by @fabiori...

Contributors

kashif, levi, and 51 other contributors

Assets 2

29 Nov 19:21

patrickvonplaten

v0.24.0

76c645d

v0.24.0: IP Adapters, Kandinsky 3.0, Stable Video Diffusion, SDXL Turbo

Stable Video Diffusion, SDXL Turbo, IP Adapters, Kandinsky 3.0

Stable Diffusion Video

Stable Video Diffusion is a powerful image-to-video generation model that can generate high resolution (576x1024) 2-4 seconds videos conditioned on the input image.

Image to Video Generation

There are two variants of SVD. SVD and SVD-XT. The SVD checkpoint is trained to generate 14 frames and the SVD-XT checkpoint is further finetuned to generate 25 frames.

You need to condition the generation on an initial image, as follows:

import torch

from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video

pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
)
pipe.enable_model_cpu_offload()

# Load the conditioning image
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png?download=true")
image = image.resize((1024, 576))

generator = torch.manual_seed(42)
frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]

export_to_video(frames, "generated.mp4", fps=7)

Since generating videos is more memory intensive, we can use the decode_chunk_size argument to control how many frames are decoded at once. This will reduce the memory usage. It's recommended to tweak this value based on your GPU memory. Setting decode_chunk_size=1 will decode one frame at a time and will use the least amount of memory, but the video might have some flickering.

Additionally, we also use model cpu offloading to reduce the memory usage.

SDXL Turbo

SDXL Turbo is an adversarial time-distilled Stable Diffusion XL (SDXL) model capable of running inference in as little as 1 step. Also, it does not use classifier-free guidance, further increasing its speed. On a good consumer GPU, you can now generate an image in just 100ms.

Text-to-Image

For text-to-image, pass a text prompt. By default, SDXL Turbo generates a 512x512 image, and that resolution gives the best results. You can try setting the height and width parameters to 768x768 or 1024x1024, but you should expect quality degradations when doing so.

Make sure to set guidance_scale to 0.0 to disable, as the model was trained without it. A single inference step is enough to generate high quality images.
Increasing the number of steps to 2, 3 or 4 should improve image quality.

from diffusers import AutoPipelineForText2Image
import torch

pipeline_text2image = AutoPipelineForText2Image.from_pretrained("stabilityai/sdxl-turbo", torch_dtype=torch.float16, variant="fp16")
pipeline_text2image = pipeline_text2image.to("cuda")

prompt = "A cinematic shot of a baby racoon wearing an intricate italian priest robe."

image = pipeline_text2image(prompt=prompt, guidance_scale=0.0, num_inference_steps=1).images[0]
image

Image-to-image

For image-to-image generation, make sure that num_inference_steps * strength is larger or equal to 1.
The image-to-image pipeline will run for int(num_inference_steps * strength) steps, e.g. 0.5 * 2.0 = 1 step in
our example below.

from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image, make_image_grid

# use from_pipe to avoid consuming additional memory when loading a checkpoint
pipeline = AutoPipelineForImage2Image.from_pipe(pipeline_text2image).to("cuda")

init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")
init_image = init_image.resize((512, 512))

prompt = "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k"

image = pipeline(prompt, image=init_image, strength=0.5, guidance_scale=0.0, num_inference_steps=2).images[0]
make_image_grid([init_image, image], rows=1, cols=2)

Image-to-image generation sample using SDXL Turbo

IP Adapters

IP Adapters have shown to be remarkably powerful at images conditioned on other images.

Thanks to @okotaku, we have added IP adapters to the most important pipelines allowing you to combine them for a variety of different workflows, e.g. they work with Img2Img2, ControlNet, and LCM-LoRA out of the box.

LCM-LoRA

from diffusers import DiffusionPipeline, LCMScheduler
import torch
from diffusers.utils import load_image

model_id =  "sd-dreambooth-library/herge-style"
lcm_lora_id = "latent-consistency/lcm-lora-sdv1-5"

pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)

pipe.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
pipe.load_lora_weights(lcm_lora_id)
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

prompt = "best quality, high quality"
image = load_image("https://user-images.githubusercontent.com/24734142/266492875-2d50d223-8475-44f0-a7c6-08b51cb53572.png")
images = pipe(
    prompt=prompt,
    ip_adapter_image=image,
    num_inference_steps=4,
    guidance_scale=1,
).images[0]

ControlNet

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch
from diffusers.utils import load_image

controlnet_model_path = "lllyasviel/control_v11f1p_sd15_depth"
controlnet = ControlNetModel.from_pretrained(controlnet_model_path, torch_dtype=torch.float16)

pipeline = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16)
pipeline.to("cuda")

image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/statue.png")
depth_map = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/depth.png")

pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")

generator = torch.Generator(device="cpu").manual_seed(33)
images = pipeline(
    prompt='best quality, high quality', 
    image=depth_map,
    ip_adapter_image=image,
    negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", 
    num_inference_steps=50,
    generator=generator,
).images
images[0].save("yiyi_test_2_out.png")

ip_image	condition	output

For more information:

👉 https://huggingface.co/docs/diffusers/main/en/using-diffusers/loading_adapters#ip-adapter

Kandinsky 3.0

Kandinsky has released the 3rd version, which has much improved text-to-image alignment thanks to using Flan-T5 as the text encoder.

Text-to-Image

from diffusers import AutoPipelineForText2Image
import torch

pipe = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
        
prompt = "A photograph of the inside of a subway train. There are raccoons sitting on the seats. One of them is reading a newspaper. The window shows the city in the background."

generator = torch.Generator(device="cpu").manual_seed(0)
image = pipe(prompt, num_inference_steps=25, generator=generator).images[0]

Image-to-Image

from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image
import torch

pipe = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
        
prompt = "A painting of the inside of a subway train with tiny raccoons."
image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky3/t2i.png")

generator = torch.Generator(device="cpu").manual_seed(0)
image = pipe(prompt, image=image, strength=0.75, num_inference_steps=25, generator=generator).images[0]

Check it out:

👉 https://huggingface.co/docs/diffusers/main/en/api/pipelines/kandinsky3#kandinsky-3

All commits

LCM-LoRA docs by @patil-suraj in #5782
[Docs] Update and make improvements by @StandardAI in #5819
[docs] Fix title by @stevhliu in #5831
Improve setup.py and add dependency check by @patrickvonplaten in #5826
[Docs] add: japanese sdxl as a reference by @sayakpaul in #5844
Set usedforsecurity=False in hashlib methods (FIPS compliance) by @Wauplin in #5790
fix memory consistency decoder test by @williamberman in #5828
[PEFT]...

Contributors

kashif, ivanprado, and 33 other contributors

Assets 2

16 Nov 14:59

patrickvonplaten

v0.23.1

4719b8f

[Patch release] Make sure we install correct PEFT version

Small patch release to make sure the correct PEFT version is installed.

All commits

Improve setup.py and add dependency check by @patrickvonplaten in #5826

Contributors

patrickvonplaten

Assets 2

09 Nov 16:30

sayakpaul

v0.23.0

fbb8b34

v0.23.0: LCM LoRA, SDXL LCM, Consistency Decoder from DALL-E 3

LCM LoRA, LCM SDXL, Consistency Decoder

LCM LoRA

Latent Consistency Models (LCM) made quite the mark in the Stable Diffusion community by enabling ultra-fast inference. LCM author @luosiallen, alongside @patil-suraj and @dg845, managed to extend the LCM support for Stable Diffusion XL (SDXL) and pack everything into a LoRA.

The approach is called LCM LoRA.

Below is an example of using LCM LoRA, taking just 4 inference steps:

from diffusers import DiffusionPipeline, LCMScheduler
import torch

model_id = "stabilityai/stable-diffusion-xl-base-1.0"
lcm_lora_id = "latent-consistency/lcm-lora-sdxl"

pipe = DiffusionPipeline.from_pretrained(model_id, variant="fp16", torch_dtype=torch.float16).to("cuda")

pipe.load_lora_weights(lcm_lora_id)
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)

prompt = "close-up photography of old man standing in the rain at night, in a street lit by lamps, leica 35mm summilux"
image = pipe(
    prompt=prompt,
    num_inference_steps=4,
    guidance_scale=1,
).images[0]

You can combine the LoRA with Img2Img, Inpaint, ControlNet, ...

as well as with other LoRAs 🤯

👉 Checkpoints
📜 Docs

If you want to learn more about the approach, please have a look at the following:

Paper
Blog

LCM SDXL

Continuing the work of Latent Consistency Models (LCM), we've applied the approach to SDXL as well and give you SSD-1B and SDXL fine-tuned checkpoints.

from diffusers import DiffusionPipeline, UNet2DConditionModel, LCMScheduler
import torch

unet = UNet2DConditionModel.from_pretrained(
    "latent-consistency/lcm-sdxl",
    torch_dtype=torch.float16,
    variant="fp16",
)
pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", unet=unet, torch_dtype=torch.float16
).to("cuda")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)

prompt = "Self-portrait oil painting, a beautiful cyborg with golden hair, 8k"

generator = torch.manual_seed(0)
image = pipe(
    prompt=prompt, num_inference_steps=4, generator=generator, guidance_scale=1.0
).images[0]

👉 Checkpoints
📜 Docs

Consistency Decoder

OpenAI open-sourced the consistency decoder used in DALL-E 3. It improves the decoding part in the Stable Diffusion v1 family of models.

import torch
from diffusers import DiffusionPipeline, ConsistencyDecoderVAE

vae = ConsistencyDecoderVAE.from_pretrained("openai/consistency-decoder", torch_dtype=pipe.torch_dtype)
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", vae=vae, torch_dtype=torch.float16
).to("cuda")

pipe("horse", generator=torch.manual_seed(0)).images

Find the documentation here to learn more.

All commits

[Custom Pipelines] Make sure that community pipelines can use repo revision by @patrickvonplaten in #5659
post release (v0.22.0) by @sayakpaul in #5658
Add Pixart to AUTO_TEXT2IMAGE_PIPELINES_MAPPING by @Beinsezii in #5664
Update custom diffusion attn processor by @DN6 in #5663
Model tests xformers fixes by @DN6 in #5679
Update free model hooks by @DN6 in #5680
Fix Basic Transformer Block by @DN6 in #5683
Explicit torch/flax dependency check by @DN6 in #5673
[PixArt-Alpha] fix mask_feature so that precomputed embeddings work with a batch size > 1 by @sayakpaul in #5677
Make sure DDPM and diffusers can be used without Transformers by @sayakpaul in #5668
[PixArt-Alpha] Support non-square images by @sayakpaul in #5672
Improve LCMScheduler by @dg845 in #5681
[Docs] Fix typos, improve, update at Using Diffusers' Task page by @StandardAI in #5611
Replacing the nn.Mish activation function with a get_activation function. by @hi-sushanta in #5651
speed up Shap-E fast test by @yiyixuxu in #5686
Fix the misaligned pipeline usage in dreamshaper docstrings by @kirill-fedyanin in #5700
Fixed is_safetensors_compatible() handling of windows path separators by @PhilLab in #5650
[LCM] Fix img2img by @patrickvonplaten in #5698
[PixArt-Alpha] fix mask feature condition. by @sayakpaul in #5695
Fix styling issues by @patrickvonplaten in #5699
Add adapter fusing + PEFT to the docs by @apolinario in #5662
Fix prompt bug in AnimateDiff by @DN6 in #5702
[Bugfix] fix error of peft lora when xformers enabled by @okotaku in #5697
Install accelerate from PyPI in PR test runner by @DN6 in #5721
consistency decoder by @williamberman in #5694
Correct consist dec by @patrickvonplaten in #5722
LCM Add Tests by @patrickvonplaten in #5707
[LCM] add: locm docs. by @sayakpaul in #5723
Add LCM Scripts by @patil-suraj in #5727

Contributors

apolinario, PhilLab, and 13 other contributors

Assets 2

08 Nov 12:30

sayakpaul

v0.22.3

482a9dd

v0.22.3: Fix PixArtAlpha and LCM Image-to-Image pipelines

🐛 There were some sneaky bugs in the PixArt-Alpha and LCM Image-to-Image pipelines which have been fixed in this release.

All commits

[LCM] Fix img2img by @patrickvonplaten in #5698
[PixArt-Alpha] fix mask feature condition. by @sayakpaul in #5695

Contributors

sayakpaul and patrickvonplaten

Assets 2

07 Nov 17:44

patrickvonplaten

v0.22.2

249c06c

Patch Release v0.22.2: Fix Animate Diff, fix DDPM import, Pixart various

Fix Basic Transformer Block by @DN6 in #5683
[PixArt-Alpha] fix mask_feature so that precomputed embeddings work with a batch size > 1 by @sayakpaul in #5677
Make sure DDPM and diffusers can be used without Transformers by @sayakpaul in #5668
[PixArt-Alpha] Support non-square images by @sayakpaul in #5672

Contributors

DN6 and sayakpaul

Assets 2

06 Nov 14:47

patrickvonplaten

v0.22.1

a1d33fc

Patch Release: Fix community vs. hub pipelines revision

[Custom Pipelines] Make sure that community pipelines can use repo revision by @patrickvonplaten

Contributors

patrickvonplaten

Assets 2

06 Nov 13:03

patrickvonplaten

v0.22.0

df60b35

v0.22.0: LCM, PixArt-Alpha, AnimateDiff, PEFT integration for LoRA, and more

Latent Consistency Models (LCM)

LCMs enable a significantly fast inference process for diffusion models. They require far fewer inference steps to produce high-resolution images without compromising the image quality too much. Below is a usage example:

import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained("SimianLuo/LCM_Dreamshaper_v7", torch_dtype=torch.float32)

# To save GPU memory, torch.float16 can be used, but it may compromise image quality.
pipe.to(torch_device="cuda", torch_dtype=torch.float32)

prompt = "Self-portrait oil painting, a beautiful cyborg with golden hair, 8k"

# Can be set to 1~50 steps. LCM support fast inference even <= 4 steps. Recommend: 1~8 steps.
num_inference_steps = 4 

images = pipe(prompt=prompt, num_inference_steps=num_inference_steps, guidance_scale=8.0).images

Refer to the documentation to learn more.

LCM comes with both text-to-image and image-to-image pipelines and they were contributed by @luosiallen, @nagolinc, and @dg845.

PixArt-Alpha

PixArt-Alpha is a Transformer-based text-to-image diffusion model that rivals the quality of the existing state-of-the-art ones, such as Stable Diffusion XL, Imagen, and DALL-E 2, while being more efficient.

It was trained T5 text embeddings and has a maximum sequence length of 120. Thus, it allows for more detailed prompt inputs, unlocking better quality generations.

Despite the large text encoder, with model offloading, it takes a little under 11GBs of VRAM to run the PixArtAlphaPipeline:

from diffusers import PixArtAlphaPipeline
import torch 

pipeline_id = "PixArt-alpha/PixArt-XL-2-1024-MS"
pipeline = PixArtAlphaPipeline.from_pretrained(pipeline_id, torch_dtype=torch.float16)
pipeline.enable_model_cpu_offload()

prompt = "A small cactus with a happy face in the Sahara desert."
image = pipe(prompt).images[0]
image.save("sahara.png")

Check out the docs to learn more.

AnimateDiff

AnimateDiff is a modelling framework that allows you to create videos using pre-existing Stable Diffusion text-to-image models. It achieves this by inserting motion module layers into a frozen text-to-image model and training it on video clips to extract a motion prior.

These motion modules are applied after the ResNet and Attention blocks in the Stable Diffusion UNet. Their purpose is to introduce coherent motion across image frames. To support these modules, we introduce the concepts of a MotionAdapter and a UNetMotionModel. These serve as a convenient way to use these motion modules with existing Stable Diffusion models.

The following example demonstrates how you can utilize the motion modules with an existing Stable Diffusion text-to-image model.

import torch
from diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler
from diffusers.utils import export_to_gif

# Load the motion adapter
adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")

# load SD 1.5 based finetuned model
model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter)
scheduler = DDIMScheduler.from_pretrained(
    model_id, subfolder="scheduler", clip_sample=False, timestep_spacing="linspace", steps_offset=1
)
pipe.scheduler = scheduler

# enable memory savings
pipe.enable_vae_slicing()
pipe.enable_model_cpu_offload()

output = pipe(
    prompt=(
        "masterpiece, bestquality, highlydetailed, ultradetailed, sunset, "
        "orange sky, warm lighting, fishing boats, ocean waves seagulls, "
        "rippling water, wharf, silhouette, serene atmosphere, dusk, evening glow, "
        "golden hour, coastal landscape, seaside scenery"
    ),
    negative_prompt="bad quality, worse quality",
    num_frames=16,
    guidance_scale=7.5,
    num_inference_steps=25,
    generator=torch.Generator("cpu").manual_seed(42),
)
frames = output.frames[0]
export_to_gif(frames, "animation.gif")

You can convert an existing 2D UNet into a UNetMotionModel:

from diffusers import MotionAdapter, UNetMotionModel, UNet2DConditionModel

unet = UNetMotionModel()

# Load from an existing 2D UNet and MotionAdapter
unet2D = UNet2DConditionModel.from_pretrained("SG161222/Realistic_Vision_V5.1_noVAE", subfolder="unet")
motion_adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")

# load motion adapter here
unet_motion = UNetMotionModel.from_unet2d(unet2D, motion_adapter: Optional = None)

# Or load motion modules after init
unet_motion.load_motion_modules(motion_adapter)

# freeze all 2D UNet layers except for the motion modules for finetuning
unet_motion.freeze_unet2d_params()

# Save only motion modules
unet_motion.save_motion_module(<path to save model>, push_to_hub=True)

AnimateDiff also comes with motion LoRA modules, letting you control subtleties:

import torch
from diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler
from diffusers.utils import export_to_gif

# Load the motion adapter
adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
# load SD 1.5 based finetuned model
model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter)
pipe.load_lora_weights("guoyww/animatediff-motion-lora-zoom-out", adapter_name="zoom-out")

scheduler = DDIMScheduler.from_pretrained(
    model_id, subfolder="scheduler", clip_sample=False, timestep_spacing="linspace", steps_offset=1
)
pipe.scheduler = scheduler

# enable memory savings
pipe.enable_vae_slicing()
pipe.enable_model_cpu_offload()

output = pipe(
    prompt=(
        "masterpiece, bestquality, highlydetailed, ultradetailed, sunset, "
        "orange sky, warm lighting, fishing boats, ocean waves seagulls, "
        "rippling water, wharf, silhouette, serene atmosphere, dusk, evening glow, "
        "golden hour, coastal landscape, seaside scenery"
    ),
    negative_prompt="bad quality, worse quality",
    num_frames=16,
    guidance_scale=7.5,
    num_inference_steps=25,
    generator=torch.Generator("cpu").manual_seed(42),
)
frames = output.frames[0]
export_to_gif(frames, "animation.gif")

Check out the documentation to learn more.

PEFT 🤝 Diffusers

There are many adapters (LoRA, for example) trained in different styles to achieve different effects. You can even combine multiple adapters to create new and unique images. With the 🤗 PEFT integration in 🤗 Diffusers, it is really easy to load and manage adapters for inference.

Here is an example of combining multiple LoRAs using this new integration:

from diffusers import DiffusionPipeline
import torch

pipe_id = "stabilityai/stable-diffusion-xl-base-1.0"
pipe = DiffusionPipeline.from_pretrained(pipe_id, torch_dtype=torch.float16).to("cuda")

# Load LoRA 1.
pipe.load_lora_weights("CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors", adapter_name="toy")
# Load LoRA 2.
pipe.load_lora_weights("nerijs/pixel-art-xl", weight_name="pixel-art-xl.safetensors", adapter_name="pixel")

# Combine the adapters.
pipe.set_adapters(["pixel", "toy"], adapter_weights=[0.5, 1.0])

# Perform inference.
prompt = "toy_face of a hacker with a hoodie, pixel art"
image = pipe(
    prompt, num_inference_steps=30, cross_attention_kwargs={"scale": 1.0}, generator=torch.manual_seed(0)
).images[0]
image

Refer to the documentation to learn more.

Community components with community pipelines

We have had support for community pipelines for a while now. This enables fast integration for pipelines we cannot directly integrate within the core codebase of the library. However, community pipelines always rely on the building blocks from Diffusers, which can be restrictive for advanced use cases.

To elevate this, we’re elevating community pipelines with community components starting this release 🤗 By specifying trust_remote_code=True and writing the pipeline repository in a specific way, users can customize their pipeline and component code as flexibly as possible:

from diffusers import DiffusionPipeline
import torch

pipeline = DiffusionPipeline.from_pretrained(
    "<change-username>/<change-id>", trust_remote_code=True, torch_dtype=torch.float16
).to("cuda")

prompt = "hello"

# Text embeds
prompt_embeds, negative_embeds = pipeline.encode_prompt(prompt)

# Keyframes generation (8x64x40, 2fps)
video_frames = pipeline(
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    num_frames=8,
    height=40,
    width=64,
    num_inference_steps=2,
    guidance_scale=9.0,
    output_type="pt"
).frames