You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
version: 4149 (1bb30bf)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Problem description & steps to reproduce
The param "cache_prompt" seems not working as expected.
I'm using the llama.cpp in pure CPU env. Command to start the server: /llama-server -m /models/qwen2.5-1.5b-q8/qwen2.5-1.5b-instruct-q8_0.gguf -c 1024 --host 0.0.0.0 --port 8000 -dkvc --metrics --file /models/promts/tool.txt --keep -1
The URI I'm using: /v1/completions , body is as below:
{
"prompt":"LONG_PROMPT + short_question1,
"cache_prompt": true,
xxxx,// other params
}
If I invoke the API with the prompt format "LONG_PROMPT" + short_question again and again, it can always work well, and reply quickly, but when I asked one another prompt, just like "short_question2", then back to call with "LONG_PROMPT" + short_question1, it will coast much a long time. It seems the prompt cache for the "LONG_PROMPT" is lost if I change the prompt without the "LONG_PROMPT".
First Bad Commit
No response
Relevant log output
No response
The text was updated successfully, but these errors were encountered:
Name and Version
version: 4149 (1bb30bf)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Problem description & steps to reproduce
The param "cache_prompt" seems not working as expected.
I'm using the llama.cpp in pure CPU env. Command to start the server:
/llama-server -m /models/qwen2.5-1.5b-q8/qwen2.5-1.5b-instruct-q8_0.gguf -c 1024 --host 0.0.0.0 --port 8000 -dkvc --metrics --file /models/promts/tool.txt --keep -1
The URI I'm using:
/v1/completions
, body is as below:If I invoke the API with the prompt format "LONG_PROMPT" + short_question again and again, it can always work well, and reply quickly, but when I asked one another prompt, just like "short_question2", then back to call with "LONG_PROMPT" + short_question1, it will coast much a long time. It seems the prompt cache for the "LONG_PROMPT" is lost if I change the prompt without the "LONG_PROMPT".
First Bad Commit
No response
Relevant log output
No response
The text was updated successfully, but these errors were encountered: