Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stable-diffusion : ggml-alloc integration and gpu acceleration #75

Merged
merged 67 commits into from
Nov 26, 2023

Conversation

FSSRepo
Copy link
Contributor

@FSSRepo FSSRepo commented Oct 22, 2023

Context

I had been waiting for a response #72 from @leejet regarding updating the GGML version so I could integrate ggml-alloc, the new memory management approach for GGML (also necessary for integrating the CUDA backend). He synchronized it with the latest changes in @ggerganov GGML repository, but it seems that in the process, the GGML library stopped working directly (at least on my computer, it's causing issues when computing) because it still has the dynamic mode integrated.

I'm using this ggml version in this PR.

I have removed the ggml submodule and directly implemented one that I already have, in which I refactored the Conv1D and Conv2D operations to use ggml_im2col and ggml_mul_mat. This will make it easier for me to create the different CUDA kernels.

Build this PR

git clone https://github.com/FSSRepo/stable-diffusion.cpp.git
mkdir build
cd build
cmake ..
cmake --build . --config Release

I've noticed that ggml-alloc consumes much more memory than leejet's current dynamic mode. I'm likely overlooking some configuration to prevent extensive compute graphs from consuming too much memory with ggml-alloc.

Here are the comparative details and results of the integration with ggml-alloc

./sd  -m kotosmix_q4_0.bin --verbose -p "beautiful anime girl, white hair, blue eyes, realistic, 4k, high quality" -n "bad quality, ugly" --steps 10 -t 6 -s 48747

Note: the stable-diffusion.cpp commit afec505 that worked for me.

Memory usage:

compute graph dynamic mode (runtime memory) ggml-alloc (compute buffer size)
learned condition (CLIP Text Encoder) 13.37MB 25.54 MB
diffusion (UNet Model) 570.65MB 1147.33 MB
AutoEncoderKL (VAE Decode) 1674.14MB 2496.00 MB

Timings:

compute graph dynamic mode ggml-alloc
learned condition (CLIP Text Encoder) 0.26 s 0.23 s
diffusion (UNet Model) ~ 18.22 s per sample ~ 13.81 s per sample
AutoEncoderKL (VAE Decode) 27 s 24.22 s

Result image (the same in both cases):

output

Over the next few days, I will be making some improvements, trying to investigate and gather feedback with ggml-alloc to avoid the current memory usage, and once I have fixed that, I can start working on implementing the CUDA kernels.

Progress overview

Here's a summary of what I will be carrying out in the coming days to complete this PR:

ggml operations that utilize stable diffusion and their availability across backends.:

operation CPU backend CUDA backend Kernel Status
ADD Support broadcasting to avoid ggml_repeat
MUL Support broadcasting to avoid ggml_repeat
CONCAT ❌ (need implement)
NORM
GET_ROWS
DIAG_MASK_INF
GROUP_NORM ❌ (need implement) The currently implemented kernel is too inefficient.
MUL_MAT
SCALE
PERMUTE -
RESHAPE -
VIEW -
SOFT_MAX
IM2COL
UPSCALE ❌ (need implement)
UNARY ❌ (only GeLU and SiLU, missing GeLU Quick)
PAD VAE decoder requires zero pad a tensor

Tasks for complete this PR:

  • Integrate ggml-alloc to simplify the use of ggml-backend.
  • Improve memory usage, determine if I'm using the ggml-alloc API correctly, I might be missing something that could help me reduce unnecessary memory usage during compute. can you help me? @slaren @ggerganov.
  • Implement the missing CUDA kernels ggml_concat, ggml_group_norm, ggml_upscale, ggml_gelu_quick.
  • Offload the model to the GPU and conduct performance and memory usage tests.
  • Implement the use of custom VAEs. I will try to implement the gguf format to simplify the loading of tensors.

@ggerganov
Copy link
Contributor

I've noticed that ggml-alloc consumes much more memory than leejet's current dynamic mode. I'm likely overlooking some configuration to prevent extensive compute graphs from consuming too much memory with ggml-alloc.

Nothing comes directly to mind. Need to have a more detailed look in the code when I get some time.
One possible error is to avoid reusing the same compute buffer for new graphs - I had some troubles with that when adding ggml-alloc to whisper.cpp: ggerganov/whisper.cpp#1270 (comment)
It does not seem very related the increased memory that you observed though.

@leejet
Copy link
Owner

leejet commented Oct 23, 2023

Great job! I'll find some time to review and merge it.

@Jonathhhan
Copy link

Jonathhhan commented Oct 23, 2023

@FSSRepo thanks.
I need to comment out line 245-247 in stable-diffusion.cpp, then it works for me:

    //if(allocr != NULL) {
        //ggml_allocr_alloc(allocr, embedding);
    //}

Otherwise I get: GGML_ASSERT: ggml-alloc.c:116: tensor->data == 0
One step takes about 11s with 512x512, which is faster than before.

@FSSRepo
Copy link
Contributor Author

FSSRepo commented Oct 23, 2023

Otherwise I get: GGML_ASSERT: ggml-alloc.c:116: tensor->data == 0

I will try to reproduce this, because it's a strange behaivor. Can you provide more information, operative system, hardware details

@Jonathhhan
Copy link

Jonathhhan commented Oct 23, 2023

Can you provide more information, system operative, hardware

Of course: Windows 11 / VS / CUDA / GTX 3090 / i7-12700k / 32GB RAM / model: v2-1_512-ema-pruned-ggml-model-f16.bin

With the main branch I get this error:
GGML_ASSERT: ggml.c: ctx->mem_buffer != NULL
zylon-ai/private-gpt#90
So maybe it is not enough space on my HDD...

@FSSRepo
Copy link
Contributor Author

FSSRepo commented Oct 23, 2023

I need to comment out line 245-257 in stable-diffusion.cpp

    //if(allocr != NULL) {
        //ggml_allocr_alloc(allocr, embedding);
    //}

245 to 257 are 12 lines or you refer 245 to 247?

@Jonathhhan
Copy link

Jonathhhan commented Oct 23, 2023

you refer 245 to 247?

Yes, my mistake (and still the error with around 20GB left on my system drive)...

@Jonathhhan
Copy link

With /D GGML_USE_CUBLAS one step 512x512 is between 8 and 9 seconds.

@FSSRepo
Copy link
Contributor Author

FSSRepo commented Oct 24, 2023

With /D GGML_USE_CUBLAS one step 512x512 is between 8 and 9 seconds.

The backend to compute is fixed to cpu_backend

        backend = ggml_backend_cpu_init();

@Jonathhhan
Copy link

The backend to compute is fixed to cpu_backend

I see. Interestingly it is already a little faster (8~ vs 10~ seconds/step).

@FSSRepo
Copy link
Contributor Author

FSSRepo commented Oct 24, 2023

The backend to compute is fixed to cpu_backend

I see. Interestingly it is already a little faster (8~ vs 10~ seconds/step).

Possibly matrix multiplications are being delegated to the GPU.

The ultimate goal of this PR is to perform 100% of the computation on the GPU and see what performance I get on my 4GB RTX 3050 laptop. Tomorrow i will finish the kernels that are missing and enable the cuda backend.

@FSSRepo
Copy link
Contributor Author

FSSRepo commented Oct 24, 2023

I need to comment out line 245-247 in stable-diffusion.cpp, then it works for me:

Was you using stable diffusion 2 when this error happen?

@FSSRepo
Copy link
Contributor Author

FSSRepo commented Oct 24, 2023

With the main branch I get this error: GGML_ASSERT: ggml.c: ctx->mem_buffer != NULL imartinez/privateGPT#90 So maybe it is not enough space on my HDD...

What matter that in this context?

@Jonathhhan
Copy link

@FSSRepo yes, it was with stable diffusion 2. I had that error GGML_ASSERT: ggml.c: ctx->mem_buffer != NULL with the main branch, so I thought maybe this one is caused by missing disc space too (but no it is not, so it does not matter in this context).

@slaren
Copy link

slaren commented Oct 24, 2023

Instead of calling ggml_allocr_alloc after every tensor, you can do something like this to allocate all the tensors created in a context:

for (struct ggml_tensor * t = ggml_get_first_tensor(ctx); t != NULL; t = ggml_get_next_tensor(ctx, t)) {
    ggml_allocr_alloc(alloc, t);
}

You can also use a similar loop calling ggml_nbytes to calculate the size of the buffer.

Caveat: the functions to enumerate the tensors in a context are only in the llama.cpp repository currently, they still need to be synced to the ggml repo.

Related: ggerganov/ggml#578

@FSSRepo
Copy link
Contributor Author

FSSRepo commented Oct 24, 2023

Why did you close this?

@leejet
Copy link
Owner

leejet commented Oct 24, 2023

To resolve merge conflicts, I rebased the master branch of FSSRepo/stable-diffusion.cpp onto the latest master branch of leejet/stable-diffusion.cpp. I also updated the URL for GGML to https://github.com/FSSRepo/ggml/tree/master to make it easier to review the changes and facilitate future merges.

@leejet
Copy link
Owner

leejet commented Oct 24, 2023

It seems like I mistakenly closed this pull request.

@FSSRepo
Copy link
Contributor Author

FSSRepo commented Oct 24, 2023

To resolve merge conflicts, I rebased the master branch of FSSRepo/stable-diffusion.cpp onto the latest master branch of leejet/stable-diffusion.cpp. I also updated the URL for GGML to https://github.com/FSSRepo/ggml/tree/master to make it easier to review the changes and facilitate future merges.

I will try to consider whether it's a good idea to do this in the future, as it would be best for me to create a specific branch for stable diffusion, where I will make the necessary modifications during the progress of this pull request, since I am using the master branch of FSSRepo/ggml for another pull request in the original ggml repository.

@leejet
Copy link
Owner

leejet commented Oct 24, 2023

It seems like I mistakenly closed this pull request.

I am unable to push my changes to https://github.com/FSSRepo/stable-diffusion.cpp/tree/master at the moment, which prevents me from reopening this pull request. Can you please pull https://github.com/leejet/stable-diffusion.cpp/tree/pr75 and push the corresponding changes to https://github.com/FSSRepo/stable-diffusion.cpp/tree/master? This way, I can reopen this pull request.

@FSSRepo
Copy link
Contributor Author

FSSRepo commented Oct 24, 2023

Try now

I created a new branch sd-cpp-fixes of my ggml repository for this PR

@leejet
Copy link
Owner

leejet commented Oct 24, 2023

Sorry for the inconvenience

@FSSRepo
Copy link
Contributor Author

FSSRepo commented Oct 24, 2023

Sorry for the inconvenience

Don't worry. Anyway, I will need your help to understand how some aspects of this stable diffusion implementation work.

@leejet
Copy link
Owner

leejet commented Oct 24, 2023

I still can't reopen this PR. It says 'there are now new commits.' It might require you to push the new changes to the master.

@FSSRepo
Copy link
Contributor Author

FSSRepo commented Oct 24, 2023

I still can't reopen this PR. It says 'there are now new commits.' It might require you to push the new changes to the master.

I will try it

@FSSRepo
Copy link
Contributor Author

FSSRepo commented Oct 24, 2023

Create this branch sd-cpp-fixes where I will be making the changes during the development of this PR. Can you change this submodule? @leejet

This will need to be done in your main repository, and then apply the changes to my fork. Afterward, I will add the changes to the stable-diffusion.cpp file and reopen this PR.

@leejet
Copy link
Owner

leejet commented Oct 24, 2023

I've updated the submodule 'ggml' in this branch https://github.com/leejet/stable-diffusion.cpp/tree/pr75 to the latest commit of FSSRepo/sd-cpp-fixes, and this branch also includes the changes you made to stable-diffusion.cpp earlier.

@leejet
Copy link
Owner

leejet commented Oct 24, 2023

You can pull this branch to your local machine, rebase your local stable-diffusion.cpp master branch to this branch, and then push the changes to https://github.com/FSSRepo/stable-diffusion.cpp/tree/master.

@Green-Sky
Copy link
Contributor

@FSSRepo the timer for sampling is not reset between images, so sampling takes longer and longer. not sure how to properly display this.

[INFO]  stable-diffusion.cpp:4685 - generating image: 1/10
[INFO]  stable-diffusion.cpp:4693 - start sampling
[DEBUG] stable-diffusion.cpp:2366 - diffusion compute buffer size: 115.28 MB
[INFO]  stable-diffusion.cpp:4511 - sampling using LCM method
  |==================================================| 26/26 - 4.16s/it
[INFO]  stable-diffusion.cpp:4698 - sampling completed, taking 107.90s
[INFO]  stable-diffusion.cpp:4685 - generating image: 2/10
[INFO]  stable-diffusion.cpp:4693 - start sampling
[DEBUG] stable-diffusion.cpp:2366 - diffusion compute buffer size: 115.28 MB
[INFO]  stable-diffusion.cpp:4511 - sampling using LCM method
  |==================================================| 26/26 - 4.33s/it
[INFO]  stable-diffusion.cpp:4698 - sampling completed, taking 218.30s
[INFO]  stable-diffusion.cpp:4685 - generating image: 3/10
[INFO]  stable-diffusion.cpp:4693 - start sampling
[DEBUG] stable-diffusion.cpp:2366 - diffusion compute buffer size: 115.28 MB
[INFO]  stable-diffusion.cpp:4511 - sampling using LCM method
  |==================================================| 26/26 - 4.19s/it
[INFO]  stable-diffusion.cpp:4698 - sampling completed, taking 329.84s
[INFO]  stable-diffusion.cpp:4685 - generating image: 4/10
[INFO]  stable-diffusion.cpp:4693 - start sampling
[DEBUG] stable-diffusion.cpp:2366 - diffusion compute buffer size: 115.28 MB
[INFO]  stable-diffusion.cpp:4511 - sampling using LCM method
  |==================================================| 26/26 - 4.11s/it
[INFO]  stable-diffusion.cpp:4698 - sampling completed, taking 438.95s
[INFO]  stable-diffusion.cpp:4685 - generating image: 5/10
[INFO]  stable-diffusion.cpp:4693 - start sampling

@Jonathhhan
Copy link

Jonathhhan commented Nov 23, 2023

One batch generation is working well for me, but if I start a second batch (with size 1, too) the app crashes and I get this error message:

[DEBUG] stable-diffusion.cpp:1101 - parse 'a lovely cat, van gogh' to [['a lovely cat, van gogh', 1], ]
[DEBUG] stable-diffusion.cpp:503  - split prompt "a lovely cat, van gogh" to tokens ["a</w>", "lovely</w>", "cat</w>", ",</w>", "van</w>", "gogh</w>", ]
[DEBUG] stable-diffusion.cpp:1034 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4049 - computing condition graph completed, taking 41 ms
[DEBUG] stable-diffusion.cpp:1101 - parse '' to [['', 1], ]
[DEBUG] stable-diffusion.cpp:503  - split prompt "" to tokens []
[DEBUG] stable-diffusion.cpp:1034 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4049 - computing condition graph completed, taking 38 ms
[INFO]  stable-diffusion.cpp:4673 - get_learned_condition completed, taking 81 ms
[INFO]  stable-diffusion.cpp:4685 - generating image: 1/4
[INFO]  stable-diffusion.cpp:4693 - start sampling
GGML_ASSERT: /mnt/c/Users/Jonat/Desktop/cp/stable-diffusion.cpp/ggml/src/ggml.c:4065: ggml_can_mul_mat(a, b)

CUDA error 4 at /mnt/c/Users/Jonat/Desktop/cp/stable-diffusion.cpp/ggml/src/ggml-cuda.cu:8550: driver shutting down
current device: 0
Segmentation fault
make: *** [../../../libs/openFrameworksCompiled/project/makefileCommon/compile.project.mk:191: RunRelease] Error 139

Anyway, it only happens with my Open Frameworks implementation (but with Linux and Windows)...

@Jonathhhan
Copy link

Jonathhhan commented Nov 23, 2023

libstable-diffusion.so points to the path where libggml.so was created.
Its possible to change that to the relative path with set_target_properties in this file https://github.com/FSSRepo/stable-diffusion.cpp/blob/master/CMakeLists.txt (not sure, if there are any drawbacks or if there is a better way to do that):

add_library(${SD_LIB} stable-diffusion.h stable-diffusion.cpp)
set_target_properties(${SD_LIB} PROPERTIES
        BUILD_WITH_INSTALL_RPATH FALSE
        LINK_FLAGS "-Wl,-rpath,$ORIGIN/")
target_link_libraries(${SD_LIB} PUBLIC ggml)
target_include_directories(${SD_LIB} PUBLIC .)
target_compile_features(${SD_LIB} PUBLIC cxx_std_11)

@FSSRepo
Copy link
Contributor Author

FSSRepo commented Nov 23, 2023

@Jonathhhan I don't notice any change when adding this part; it continues to generate stable-diffusion.lib in the same directory.

set_target_properties(${SD_LIB} PROPERTIES
        BUILD_WITH_INSTALL_RPATH FALSE
        LINK_FLAGS "-Wl,-rpath,$ORIGIN/")

@Jonathhhan
Copy link

Jonathhhan commented Nov 23, 2023

@FSSRepo yes, it is still generated in the same directory. The difference is, if I run the program with the change stable-diffusion.so links to ggml.so in the same folder (wherever it is). While before stable-diffusion.so was linked to the ggml.so file in the folder where it was created.

@FSSRepo
Copy link
Contributor Author

FSSRepo commented Nov 23, 2023

@FSSRepo yes, it is still generated in the same directory. The difference is, if I run the program with the change stable-diffusion.so links to ggml.so in the same folder (wherever it is). While before stable-diffusion.so was linked to the ggml.so file in the folder where it was created.

this fixes your problem?

GGML_ASSERT: /mnt/c/Users/Jonat/Desktop/cp/stable-diffusion.cpp/ggml/src/ggml.c:4065: ggml_can_mul_mat(a, b)

CUDA error 4 at /mnt/c/Users/Jonat/Desktop/cp/stable-diffusion.cpp/ggml/src/ggml-cuda.cu:8550: driver shutting down
current device: 0
Segmentation fault
make: *** [../../../libs/openFrameworksCompiled/project/makefileCommon/compile.project.mk:191: RunRelease] Error 139

@Jonathhhan
Copy link

Jonathhhan commented Nov 23, 2023

@FSSRepo no, its a different one. Maybe that is (really no idea), because the buffers/contexts are freed after each batch and they only should be freed if the program is closed or recreated if a new model is loaded?

Edit: it works, if I do not free the context at the end of StableDiffusion::txt2img / StableDiffusion::img2img.
// ggml_free(work_ctx);

@Jonathhhan
Copy link

With StableDiffusion::img2img I get:

[INFO]  stable-diffusion.cpp:4813 - sampling using modified DPM++ (2M) method
GGML_ASSERT: /mnt/c/Users/Jonat/Desktop/cp/stable-diffusion.cpp/ggml/src/ggml.c:5253: a->ne[2] == b->ne[2]

CUDA error 4 at /mnt/c/Users/Jonat/Desktop/cp/stable-diffusion.cpp/ggml/src/ggml-cuda.cu:8550: driver shutting down
current device: 0
Segmentation fault

@FSSRepo
Copy link
Contributor Author

FSSRepo commented Nov 23, 2023

@Jonathhhan check now

@Jonathhhan
Copy link

Jonathhhan commented Nov 23, 2023

@FSSRepo it works. Thanks.

@paulocoutinhox
Copy link

Hi, one question. It works on iOS?

@FSSRepo
Copy link
Contributor Author

FSSRepo commented Nov 25, 2023

@paulocoutinhox Yes, it can be done, but it requires creating the missing kernels for the Metal backend based on the ones I have created for cuBLAS. I don't have a Mac or iPhone, and I don't know much about Metal programming, so I am the least qualified to contribute anything.

@paulocoutinhox
Copy link

Nice. But i run i run today in one iPhone, what happen? It will run on CPU?

@Green-Sky
Copy link
Contributor

Nice. But i run i run today in one iPhone, what happen? It will run on CPU?

yes

@leejet
Copy link
Owner

leejet commented Nov 26, 2023

I've tested it on Windows/Linux/macOS, and now it's working perfectly. The PR is ready for merging. Thank you for your excellent work!

@leejet leejet merged commit 8124588 into leejet:master Nov 26, 2023
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants