Haoge Deng1,4*, Ting Pan2,4*, Haiwen Diao3,4*, Zhengxiong Luo4*, Yufeng Cui4
Huchuan Lu3, Shiguang Shan2, Yonggang Qi1, Xinlong Wang4†
BUPT1, ICT-CAS2, DLUT3, BAAI4
* Equal Contribution, † Corresponding Author
We present NOVA (NOn-Quantized Video Autoregressive Model), a model that enables autoregressive image/video generation with high efficiency. NOVA reformulates the video generation problem as non-quantized autoregressive modeling of temporal frame-by-frame prediction and spatial set-by-set prediction. NOVA generalizes well and enables diverse zero-shot generation abilities in one unified model.
[Dec 2024]
Released Project Page[Dec 2024]
Released 🤗 Online Demo (T2I, T2V)[Dec 2024]
Released paper, weights, and Quick Start guide and Gradio Demo local code .
- 🔥 Novel Approach: Non-quantized video autoregressive generation.
- 🔥 State-of-the-art Performance: High efficiency with state-of-the-art t2i/t2v results.
- 🔥 Unified Modeling: Multi-task capabilities in a single unified model.
See detailed description in Model Zoo
Model | Parameters | Resolution | Data | Weight | GenEval | DPGBench |
---|---|---|---|---|---|---|
NOVA-0.6B | 0.6B | 512x512 | 16M | 🤗 HF link | 0.75 | 81.76 |
NOVA-0.3B | 0.3B | 1024x1024 | 600M | 🤗 HF link | 0.67 | 80.60 |
NOVA-0.6B | 0.6B | 1024x1024 | 600M | 🤗 HF link | 0.69 | 82.25 |
NOVA-1.4B | 1.4B | 1024x1024 | 600M | 🤗 HF link | 0.71 | 83.01 |
Model | Parameters | Resolution | Data | Weight | VBench |
---|---|---|---|---|---|
NOVA-0.6B | 0.6B | 33x768x480 | 20M | 🤗 HF link | 80.12 |
Clone this repository to local disk and install:
pip install diffusers transformers accelerate imageio[ffmpeg]
git clone https://github.com/baaivision/NOVA.git
cd NOVA && pip install .
You can also install from the remote repository if you have set your Github SSH key:
pip install diffusers transformers accelerate imageio[ffmpeg]
pip install git+ssh://[email protected]/baaivision/NOVA.git
import torch
from diffnext.pipelines import NOVAPipeline
model_id = "BAAI/nova-d48w768-sdxl1024"
model_args = {"torch_dtype": torch.float16, "trust_remote_code": True}
pipe = NOVAPipeline.from_pretrained(model_id, **model_args)
pipe = pipe.to("cuda")
prompt = "a shiba inu wearing a beret and black turtleneck."
image = pipe(prompt).images[0]
image.save("shiba_inu.jpg")
import os
import torch
from diffnext.pipelines import NOVAPipeline
from diffnext.utils import export_to_image, export_to_video
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
model_id = "BAAI/nova-d48w1024-osp480"
low_memory = False
model_args = {"torch_dtype": torch.float16, "trust_remote_code": True}
pipe = NOVAPipeline.from_pretrained(model_id, **model_args)
if low_memory:
# Use CPU model offload routine and expandable allocator if OOM.
pipe.enable_model_cpu_offload()
else:
pipe = pipe.to("cuda")
# Text to Video
prompt = "Many spotted jellyfish pulsating under water."
video = pipe(prompt, max_latent_length=9).frames[0]
export_to_video(video, "jellyfish.mp4", fps=12)
# Increase AR and diffusion steps for better video quality.
video = pipe(
prompt,
max_latent_length=9,
num_inference_steps=128, # default: 64
num_diffusion_steps=100, # default: 25
).frames[0]
export_to_video(video, "jellyfish_v2.mp4", fps=12)
# You can also generate images from text, with the first frame as an image.
prompt = "Many spotted jellyfish pulsating under water."
image = pipe(prompt, max_latent_length=1).frames[0, 0]
export_to_image(image, "jellyfish.jpg")
# For text-to-image demo
python scripts/app_nova_t2i.py --model "BAAI/nova-d48w1024-sdxl1024" --device 0
# For text-to-video demo
python scripts/app_nova_t2v.py --model "BAAI/nova-d48w1024-osp480" --device 0
- See Training Guide
- See Inference Guide
- See Evaluation Guide
- Model zoo
- Quick Start
- Gradio Demo
- Inference guide
- Finetuning code
- Training code
- Evaluation code
- Prompt Writer
- Larger model size
- Additional downstream tasks: Image editing, Video editing, Controllable generation
If you find this repository useful, please consider giving a star ⭐ and citation 🦖:
@article{deng2024nova,
title={Autoregressive Video Generation without Vector Quantization},
author={Deng, Haoge and Pan, Ting and Diao, Haiwen and Luo, Zhengxiong and Cui, Yufeng and Lu, Huchuan and Shan, Shiguang and Qi, Yonggang and Wang, Xinlong},
journal={arXiv preprint arXiv:2412.14169},
year={2024}
}
We thank the repositories: MAE, MAR, MaskGIT, DiT, Open-Sora-Plan, CogVideo, FLUX and CodeWithGPU.
Code and models are licensed under Apache License 2.0.