-
Notifications
You must be signed in to change notification settings - Fork 368
Why is the feed_prompt process so slow? #439
Comments
Hey there! Thanks for reporting this and providing lots of detail :) The issue here is that the version of GGML we use doesn’t support a specific operation required for feeding more than one token at a time with Metal (i.e. this works fine with CUDA, not Metal). See also #403. This has been fixed in upstream GGML/llama.cpp, but we haven’t integrated that fix yet. The work has started in #428 and that should hopefully be finished within the next week (I’m out of town but I hope to get back to it soon). Hope that helps clarify the state of affairs! |
I'm very happy to hear this news and looking forward to the merged version. Thank you for your work. Can I wait until after the release to close this issue? |
hello @philpax has there been any recent movement on this? |
I started working on it, but realised that it would end up being quite a large task. Still working on it, but it'll take some time. |
thanks |
LLM is indeed a fantastic library and very easy to use. However, after using LLM for a few days, I noticed that the process of
feed_prompt
is always very slow. It consumes a significant amount of CPU resources and doesn't utilize GPU resources (I found in the hardware acceleration documentation thatfeed_prompt
currently doesn't use GPU resources). As a result, if I add some context during the conversation, it takes a long time to wait for feed_prompt to complete, which is not ideal for the actual user experience. I used TheBloke/Llama-2-7B-Chat-GGML/llama-2-7b-chat.ggmlv3.q2_K.bin for testing.Using the same model and prompt, I tested with
llama.cpp
, and its first token response time is very fast. I'm not sure what the difference is in thefeed_prompt
process betweenllm
andllama.cpp
. By observing CPU history and GPU history,It seems likellama.cpp
is fully utilizing the GPU for inference.Can you please help me identify what's wrong?
Model:
System:
llama.cpp command:
llama.cpp Result:
llm sample code:
llm sample code result:
The text was updated successfully, but these errors were encountered: