stable-diffusion : ggml-alloc integration and gpu acceleration #75

FSSRepo · 2023-10-22T22:42:13Z

Context

I had been waiting for a response #72 from @leejet regarding updating the GGML version so I could integrate ggml-alloc, the new memory management approach for GGML (also necessary for integrating the CUDA backend). He synchronized it with the latest changes in @ggerganov GGML repository, but it seems that in the process, the GGML library stopped working directly (at least on my computer, it's causing issues when computing) because it still has the dynamic mode integrated.

I'm using this ggml version in this PR.

I have removed the ggml submodule and directly implemented one that I already have, in which I refactored the Conv1D and Conv2D operations to use ggml_im2col and ggml_mul_mat. This will make it easier for me to create the different CUDA kernels.

Build this PR

git clone https://github.com/FSSRepo/stable-diffusion.cpp.git
mkdir build
cd build
cmake ..
cmake --build . --config Release

I've noticed that ggml-alloc consumes much more memory than leejet's current dynamic mode. I'm likely overlooking some configuration to prevent extensive compute graphs from consuming too much memory with ggml-alloc.

Here are the comparative details and results of the integration with `ggml-alloc`

./sd  -m kotosmix_q4_0.bin --verbose -p "beautiful anime girl, white hair, blue eyes, realistic, 4k, high quality" -n "bad quality, ugly" --steps 10 -t 6 -s 48747

Note: the stable-diffusion.cpp commit afec505 that worked for me.

Memory usage:

compute graph	dynamic mode (runtime memory)	ggml-alloc (compute buffer size)
learned condition (CLIP Text Encoder)	13.37MB	25.54 MB
diffusion (UNet Model)	570.65MB	1147.33 MB
AutoEncoderKL (VAE Decode)	1674.14MB	2496.00 MB

Timings:

compute graph	dynamic mode	ggml-alloc
learned condition (CLIP Text Encoder)	0.26 s	0.23 s
diffusion (UNet Model)	~ 18.22 s per sample	~ 13.81 s per sample
AutoEncoderKL (VAE Decode)	27 s	24.22 s

Result image (the same in both cases):

Over the next few days, I will be making some improvements, trying to investigate and gather feedback with ggml-alloc to avoid the current memory usage, and once I have fixed that, I can start working on implementing the CUDA kernels.

Progress overview

Here's a summary of what I will be carrying out in the coming days to complete this PR:

ggml operations that utilize stable diffusion and their availability across backends.:

operation	CPU backend	CUDA backend	Kernel Status
ADD	✅	✅	Support broadcasting to avoid `ggml_repeat`
MUL	✅	✅	Support broadcasting to avoid `ggml_repeat`
CONCAT	✅	❌ (need implement)
NORM	✅	✅
GET_ROWS	✅	✅
DIAG_MASK_INF	✅	✅
GROUP_NORM	✅	❌ (need implement)	The currently implemented kernel is too inefficient.
MUL_MAT	✅	✅
SCALE	✅	✅
PERMUTE	✅	-
RESHAPE	✅	-
VIEW	✅	-
SOFT_MAX	✅	✅
IM2COL	✅	✅
UPSCALE	✅	❌ (need implement)
UNARY	✅	❌ (only GeLU and SiLU, missing GeLU Quick)
PAD	❌	❌	VAE decoder requires zero pad a tensor

Tasks for complete this PR:

Integrate ggml-alloc to simplify the use of ggml-backend.
Improve memory usage, determine if I'm using the ggml-alloc API correctly, I might be missing something that could help me reduce unnecessary memory usage during compute. can you help me? @slaren @ggerganov.
Implement the missing CUDA kernels ggml_concat, ggml_group_norm, ggml_upscale, ggml_gelu_quick.
Offload the model to the GPU and conduct performance and memory usage tests.
Implement the use of custom VAEs. I will try to implement the gguf format to simplify the loading of tensors.

ggerganov · 2023-10-23T07:22:26Z

I've noticed that ggml-alloc consumes much more memory than leejet's current dynamic mode. I'm likely overlooking some configuration to prevent extensive compute graphs from consuming too much memory with ggml-alloc.

Nothing comes directly to mind. Need to have a more detailed look in the code when I get some time.
One possible error is to avoid reusing the same compute buffer for new graphs - I had some troubles with that when adding ggml-alloc to whisper.cpp: ggerganov/whisper.cpp#1270 (comment)
It does not seem very related the increased memory that you observed though.

leejet · 2023-10-23T13:17:03Z

Great job! I'll find some time to review and merge it.

Jonathhhan · 2023-10-23T22:52:20Z

@FSSRepo thanks.
I need to comment out line 245-247 in stable-diffusion.cpp, then it works for me:

    //if(allocr != NULL) {
        //ggml_allocr_alloc(allocr, embedding);
    //}

Otherwise I get: GGML_ASSERT: ggml-alloc.c:116: tensor->data == 0
One step takes about 11s with 512x512, which is faster than before.

FSSRepo · 2023-10-23T23:05:24Z

Otherwise I get: GGML_ASSERT: ggml-alloc.c:116: tensor->data == 0

I will try to reproduce this, because it's a strange behaivor. Can you provide more information, operative system, hardware details

Jonathhhan · 2023-10-23T23:13:07Z

Can you provide more information, system operative, hardware

Of course: Windows 11 / VS / CUDA / GTX 3090 / i7-12700k / 32GB RAM / model: v2-1_512-ema-pruned-ggml-model-f16.bin

With the main branch I get this error:
GGML_ASSERT: ggml.c: ctx->mem_buffer != NULL
zylon-ai/private-gpt#90
So maybe it is not enough space on my HDD...

FSSRepo · 2023-10-23T23:21:57Z

I need to comment out line 245-257 in stable-diffusion.cpp

    //if(allocr != NULL) {
        //ggml_allocr_alloc(allocr, embedding);
    //}

245 to 257 are 12 lines or you refer 245 to 247?

Jonathhhan · 2023-10-23T23:29:30Z

you refer 245 to 247?

Yes, my mistake (and still the error with around 20GB left on my system drive)...

Jonathhhan · 2023-10-24T00:44:02Z

With /D GGML_USE_CUBLAS one step 512x512 is between 8 and 9 seconds.

FSSRepo · 2023-10-24T00:49:06Z

With /D GGML_USE_CUBLAS one step 512x512 is between 8 and 9 seconds.

The backend to compute is fixed to cpu_backend

        backend = ggml_backend_cpu_init();

Jonathhhan · 2023-10-24T00:54:46Z

The backend to compute is fixed to cpu_backend

I see. Interestingly it is already a little faster (8~ vs 10~ seconds/step).

FSSRepo · 2023-10-24T00:57:33Z

The backend to compute is fixed to cpu_backend

I see. Interestingly it is already a little faster (8~ vs 10~ seconds/step).

Possibly matrix multiplications are being delegated to the GPU.

The ultimate goal of this PR is to perform 100% of the computation on the GPU and see what performance I get on my 4GB RTX 3050 laptop. Tomorrow i will finish the kernels that are missing and enable the cuda backend.

FSSRepo · 2023-10-24T01:53:47Z

I need to comment out line 245-247 in stable-diffusion.cpp, then it works for me:

Was you using stable diffusion 2 when this error happen?

FSSRepo · 2023-10-24T01:56:50Z

With the main branch I get this error: GGML_ASSERT: ggml.c: ctx->mem_buffer != NULL imartinez/privateGPT#90 So maybe it is not enough space on my HDD...

What matter that in this context?

Jonathhhan · 2023-10-24T02:03:14Z

@FSSRepo yes, it was with stable diffusion 2. I had that error GGML_ASSERT: ggml.c: ctx->mem_buffer != NULL with the main branch, so I thought maybe this one is caused by missing disc space too (but no it is not, so it does not matter in this context).

slaren · 2023-10-24T13:17:37Z

Instead of calling ggml_allocr_alloc after every tensor, you can do something like this to allocate all the tensors created in a context:

for (struct ggml_tensor * t = ggml_get_first_tensor(ctx); t != NULL; t = ggml_get_next_tensor(ctx, t)) {
    ggml_allocr_alloc(alloc, t);
}

You can also use a similar loop calling ggml_nbytes to calculate the size of the buffer.

Caveat: the functions to enumerate the tensors in a context are only in the llama.cpp repository currently, they still need to be synced to the ggml repo.

Related: ggerganov/ggml#578

FSSRepo · 2023-10-24T15:14:40Z

Why did you close this?

leejet · 2023-10-24T15:14:54Z

To resolve merge conflicts, I rebased the master branch of FSSRepo/stable-diffusion.cpp onto the latest master branch of leejet/stable-diffusion.cpp. I also updated the URL for GGML to https://github.com/FSSRepo/ggml/tree/master to make it easier to review the changes and facilitate future merges.

leejet · 2023-10-24T15:15:01Z

It seems like I mistakenly closed this pull request.

FSSRepo · 2023-10-24T15:23:33Z

To resolve merge conflicts, I rebased the master branch of FSSRepo/stable-diffusion.cpp onto the latest master branch of leejet/stable-diffusion.cpp. I also updated the URL for GGML to https://github.com/FSSRepo/ggml/tree/master to make it easier to review the changes and facilitate future merges.

I will try to consider whether it's a good idea to do this in the future, as it would be best for me to create a specific branch for stable diffusion, where I will make the necessary modifications during the progress of this pull request, since I am using the master branch of FSSRepo/ggml for another pull request in the original ggml repository.

leejet · 2023-10-24T15:29:14Z

It seems like I mistakenly closed this pull request.

I am unable to push my changes to https://github.com/FSSRepo/stable-diffusion.cpp/tree/master at the moment, which prevents me from reopening this pull request. Can you please pull https://github.com/leejet/stable-diffusion.cpp/tree/pr75 and push the corresponding changes to https://github.com/FSSRepo/stable-diffusion.cpp/tree/master? This way, I can reopen this pull request.

FSSRepo · 2023-10-24T15:33:45Z

Try now

I created a new branch sd-cpp-fixes of my ggml repository for this PR

leejet · 2023-10-24T15:37:07Z

Sorry for the inconvenience

FSSRepo · 2023-10-24T15:39:36Z

Sorry for the inconvenience

Don't worry. Anyway, I will need your help to understand how some aspects of this stable diffusion implementation work.

leejet · 2023-10-24T15:44:19Z

I still can't reopen this PR. It says 'there are now new commits.' It might require you to push the new changes to the master.

FSSRepo · 2023-10-24T15:52:51Z

I still can't reopen this PR. It says 'there are now new commits.' It might require you to push the new changes to the master.

I will try it

FSSRepo · 2023-10-24T16:11:05Z

Create this branch sd-cpp-fixes where I will be making the changes during the development of this PR. Can you change this submodule? @leejet

This will need to be done in your main repository, and then apply the changes to my fork. Afterward, I will add the changes to the stable-diffusion.cpp file and reopen this PR.

leejet · 2023-10-24T16:23:44Z

I've updated the submodule 'ggml' in this branch https://github.com/leejet/stable-diffusion.cpp/tree/pr75 to the latest commit of FSSRepo/sd-cpp-fixes, and this branch also includes the changes you made to stable-diffusion.cpp earlier.

leejet · 2023-10-24T16:27:27Z

You can pull this branch to your local machine, rebase your local stable-diffusion.cpp master branch to this branch, and then push the changes to https://github.com/FSSRepo/stable-diffusion.cpp/tree/master.

Green-Sky · 2023-11-22T23:28:34Z

@FSSRepo the timer for sampling is not reset between images, so sampling takes longer and longer. not sure how to properly display this.

[INFO]  stable-diffusion.cpp:4685 - generating image: 1/10
[INFO]  stable-diffusion.cpp:4693 - start sampling
[DEBUG] stable-diffusion.cpp:2366 - diffusion compute buffer size: 115.28 MB
[INFO]  stable-diffusion.cpp:4511 - sampling using LCM method
  |==================================================| 26/26 - 4.16s/it
[INFO]  stable-diffusion.cpp:4698 - sampling completed, taking 107.90s
[INFO]  stable-diffusion.cpp:4685 - generating image: 2/10
[INFO]  stable-diffusion.cpp:4693 - start sampling
[DEBUG] stable-diffusion.cpp:2366 - diffusion compute buffer size: 115.28 MB
[INFO]  stable-diffusion.cpp:4511 - sampling using LCM method
  |==================================================| 26/26 - 4.33s/it
[INFO]  stable-diffusion.cpp:4698 - sampling completed, taking 218.30s
[INFO]  stable-diffusion.cpp:4685 - generating image: 3/10
[INFO]  stable-diffusion.cpp:4693 - start sampling
[DEBUG] stable-diffusion.cpp:2366 - diffusion compute buffer size: 115.28 MB
[INFO]  stable-diffusion.cpp:4511 - sampling using LCM method
  |==================================================| 26/26 - 4.19s/it
[INFO]  stable-diffusion.cpp:4698 - sampling completed, taking 329.84s
[INFO]  stable-diffusion.cpp:4685 - generating image: 4/10
[INFO]  stable-diffusion.cpp:4693 - start sampling
[DEBUG] stable-diffusion.cpp:2366 - diffusion compute buffer size: 115.28 MB
[INFO]  stable-diffusion.cpp:4511 - sampling using LCM method
  |==================================================| 26/26 - 4.11s/it
[INFO]  stable-diffusion.cpp:4698 - sampling completed, taking 438.95s
[INFO]  stable-diffusion.cpp:4685 - generating image: 5/10
[INFO]  stable-diffusion.cpp:4693 - start sampling

Jonathhhan · 2023-11-23T02:44:28Z

One batch generation is working well for me, but if I start a second batch (with size 1, too) the app crashes and I get this error message:

[DEBUG] stable-diffusion.cpp:1101 - parse 'a lovely cat, van gogh' to [['a lovely cat, van gogh', 1], ]
[DEBUG] stable-diffusion.cpp:503  - split prompt "a lovely cat, van gogh" to tokens ["a</w>", "lovely</w>", "cat</w>", ",</w>", "van</w>", "gogh</w>", ]
[DEBUG] stable-diffusion.cpp:1034 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4049 - computing condition graph completed, taking 41 ms
[DEBUG] stable-diffusion.cpp:1101 - parse '' to [['', 1], ]
[DEBUG] stable-diffusion.cpp:503  - split prompt "" to tokens []
[DEBUG] stable-diffusion.cpp:1034 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4049 - computing condition graph completed, taking 38 ms
[INFO]  stable-diffusion.cpp:4673 - get_learned_condition completed, taking 81 ms
[INFO]  stable-diffusion.cpp:4685 - generating image: 1/4
[INFO]  stable-diffusion.cpp:4693 - start sampling
GGML_ASSERT: /mnt/c/Users/Jonat/Desktop/cp/stable-diffusion.cpp/ggml/src/ggml.c:4065: ggml_can_mul_mat(a, b)

CUDA error 4 at /mnt/c/Users/Jonat/Desktop/cp/stable-diffusion.cpp/ggml/src/ggml-cuda.cu:8550: driver shutting down
current device: 0
Segmentation fault
make: *** [../../../libs/openFrameworksCompiled/project/makefileCommon/compile.project.mk:191: RunRelease] Error 139

Anyway, it only happens with my Open Frameworks implementation (but with Linux and Windows)...

Jonathhhan · 2023-11-23T05:46:12Z

libstable-diffusion.so points to the path where libggml.so was created.
Its possible to change that to the relative path with set_target_properties in this file https://github.com/FSSRepo/stable-diffusion.cpp/blob/master/CMakeLists.txt (not sure, if there are any drawbacks or if there is a better way to do that):

add_library(${SD_LIB} stable-diffusion.h stable-diffusion.cpp)
set_target_properties(${SD_LIB} PROPERTIES
        BUILD_WITH_INSTALL_RPATH FALSE
        LINK_FLAGS "-Wl,-rpath,$ORIGIN/")
target_link_libraries(${SD_LIB} PUBLIC ggml)
target_include_directories(${SD_LIB} PUBLIC .)
target_compile_features(${SD_LIB} PUBLIC cxx_std_11)

FSSRepo · 2023-11-23T14:06:11Z

@Jonathhhan I don't notice any change when adding this part; it continues to generate stable-diffusion.lib in the same directory.

set_target_properties(${SD_LIB} PROPERTIES
        BUILD_WITH_INSTALL_RPATH FALSE
        LINK_FLAGS "-Wl,-rpath,$ORIGIN/")

Jonathhhan · 2023-11-23T14:15:19Z

@FSSRepo yes, it is still generated in the same directory. The difference is, if I run the program with the change stable-diffusion.so links to ggml.so in the same folder (wherever it is). While before stable-diffusion.so was linked to the ggml.so file in the folder where it was created.

FSSRepo · 2023-11-23T14:19:29Z

@FSSRepo yes, it is still generated in the same directory. The difference is, if I run the program with the change stable-diffusion.so links to ggml.so in the same folder (wherever it is). While before stable-diffusion.so was linked to the ggml.so file in the folder where it was created.

this fixes your problem?

GGML_ASSERT: /mnt/c/Users/Jonat/Desktop/cp/stable-diffusion.cpp/ggml/src/ggml.c:4065: ggml_can_mul_mat(a, b)

CUDA error 4 at /mnt/c/Users/Jonat/Desktop/cp/stable-diffusion.cpp/ggml/src/ggml-cuda.cu:8550: driver shutting down
current device: 0
Segmentation fault
make: *** [../../../libs/openFrameworksCompiled/project/makefileCommon/compile.project.mk:191: RunRelease] Error 139

Jonathhhan · 2023-11-23T14:24:22Z

@FSSRepo no, its a different one. Maybe that is (really no idea), because the buffers/contexts are freed after each batch and they only should be freed if the program is closed or recreated if a new model is loaded?

Edit: it works, if I do not free the context at the end of StableDiffusion::txt2img / StableDiffusion::img2img.
// ggml_free(work_ctx);

Jonathhhan · 2023-11-23T20:38:28Z

With StableDiffusion::img2img I get:

[INFO]  stable-diffusion.cpp:4813 - sampling using modified DPM++ (2M) method
GGML_ASSERT: /mnt/c/Users/Jonat/Desktop/cp/stable-diffusion.cpp/ggml/src/ggml.c:5253: a->ne[2] == b->ne[2]

CUDA error 4 at /mnt/c/Users/Jonat/Desktop/cp/stable-diffusion.cpp/ggml/src/ggml-cuda.cu:8550: driver shutting down
current device: 0
Segmentation fault

FSSRepo · 2023-11-23T21:30:44Z

@Jonathhhan check now

Jonathhhan · 2023-11-23T22:17:53Z

@FSSRepo it works. Thanks.

paulocoutinhox · 2023-11-25T04:30:22Z

Hi, one question. It works on iOS?

FSSRepo · 2023-11-25T04:41:47Z

@paulocoutinhox Yes, it can be done, but it requires creating the missing kernels for the Metal backend based on the ones I have created for cuBLAS. I don't have a Mac or iPhone, and I don't know much about Metal programming, so I am the least qualified to contribute anything.

paulocoutinhox · 2023-11-25T04:44:41Z

Nice. But i run i run today in one iPhone, what happen? It will run on CPU?

short options should be only a single character

Green-Sky · 2023-11-25T12:14:07Z

Nice. But i run i run today in one iPhone, what happen? It will run on CPU?

yes

employ PascalCase for class/struct/enumnames and underscore_case for variables/functions/methods.

leejet · 2023-11-26T10:59:20Z

I've tested it on Windows/Linux/macOS, and now it's working perfectly. The PR is ready for merging. Thank you for your excellent work!

leejet closed this Oct 24, 2023

leejet force-pushed the master branch from 188a65d to 4c96185 Compare October 24, 2023 15:09

FSSRepo added 2 commits November 22, 2023 21:43

fix and improve timings

3b688b5

delete leaked images

750366e

fix img2img

1bd91ca

simplify saving images

0938476

leejet added 4 commits November 25, 2023 14:47

ignore models

eca9455

refactor command-line parameter formats for consistency

3dc5855

short options should be only a single character

a few modifications

7f1a59a

clear the compilation warnings

27dfe3c

leejet added 4 commits November 26, 2023 16:10

standardize naming conventions

60d9f8a

employ PascalCase for class/struct/enumnames and underscore_case for variables/functions/methods.

attempt to fix a build failure in some compilation environments

b9a02c1

attempt to fix a build failure in macOS

40d3f7e

format code using .clang-format

9affd38

leejet merged commit 8124588 into leejet:master Nov 26, 2023
7 checks passed

This was referenced Nov 26, 2023

Make a PR updating ggml and support CUDA backend #72

Closed

Latent Consistency Models support would be great in this project #80

Closed

stable-diffusion : ggml-alloc integration and gpu acceleration #75

stable-diffusion : ggml-alloc integration and gpu acceleration #75

Conversation

FSSRepo commented Oct 22, 2023 • edited Loading

Context

Build this PR

Here are the comparative details and results of the integration with ggml-alloc

Progress overview

ggerganov commented Oct 23, 2023

leejet commented Oct 23, 2023

Jonathhhan commented Oct 23, 2023 • edited Loading

FSSRepo commented Oct 23, 2023 • edited Loading

Jonathhhan commented Oct 23, 2023 • edited Loading

FSSRepo commented Oct 23, 2023

Jonathhhan commented Oct 23, 2023 • edited Loading

Jonathhhan commented Oct 24, 2023

FSSRepo commented Oct 24, 2023 • edited Loading

Jonathhhan commented Oct 24, 2023

FSSRepo commented Oct 24, 2023 • edited Loading

FSSRepo commented Oct 24, 2023 • edited Loading

FSSRepo commented Oct 24, 2023

Jonathhhan commented Oct 24, 2023

slaren commented Oct 24, 2023 • edited Loading

FSSRepo commented Oct 24, 2023

leejet commented Oct 24, 2023

leejet commented Oct 24, 2023

FSSRepo commented Oct 24, 2023 • edited Loading

leejet commented Oct 24, 2023

FSSRepo commented Oct 24, 2023 • edited Loading

leejet commented Oct 24, 2023

FSSRepo commented Oct 24, 2023

leejet commented Oct 24, 2023

FSSRepo commented Oct 24, 2023

FSSRepo commented Oct 24, 2023 • edited Loading

leejet commented Oct 24, 2023

leejet commented Oct 24, 2023

Green-Sky commented Nov 22, 2023

Jonathhhan commented Nov 23, 2023 • edited Loading

Jonathhhan commented Nov 23, 2023 • edited Loading

FSSRepo commented Nov 23, 2023 • edited Loading

Jonathhhan commented Nov 23, 2023 • edited Loading

FSSRepo commented Nov 23, 2023

Jonathhhan commented Nov 23, 2023 • edited Loading

Jonathhhan commented Nov 23, 2023

FSSRepo commented Nov 23, 2023

Jonathhhan commented Nov 23, 2023 • edited Loading

paulocoutinhox commented Nov 25, 2023

FSSRepo commented Nov 25, 2023

paulocoutinhox commented Nov 25, 2023

Green-Sky commented Nov 25, 2023

leejet commented Nov 26, 2023

FSSRepo commented Oct 22, 2023 •

edited

Loading

Here are the comparative details and results of the integration with `ggml-alloc`

Jonathhhan commented Oct 23, 2023 •

edited

Loading

FSSRepo commented Oct 23, 2023 •

edited

Loading

Jonathhhan commented Oct 23, 2023 •

edited

Loading

Jonathhhan commented Oct 23, 2023 •

edited

Loading

FSSRepo commented Oct 24, 2023 •

edited

Loading

FSSRepo commented Oct 24, 2023 •

edited

Loading

FSSRepo commented Oct 24, 2023 •

edited

Loading

slaren commented Oct 24, 2023 •

edited

Loading

FSSRepo commented Oct 24, 2023 •

edited

Loading

FSSRepo commented Oct 24, 2023 •

edited

Loading

FSSRepo commented Oct 24, 2023 •

edited

Loading

Jonathhhan commented Nov 23, 2023 •

edited

Loading

Jonathhhan commented Nov 23, 2023 •

edited

Loading

FSSRepo commented Nov 23, 2023 •

edited

Loading

Jonathhhan commented Nov 23, 2023 •

edited

Loading

Jonathhhan commented Nov 23, 2023 •

edited

Loading

Jonathhhan commented Nov 23, 2023 •

edited

Loading