Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. bug: The cache_prompt parameter not working properly #10993

Open
feikiss opened this issue Dec 27, 2024 · 0 comments
Open

Misc. bug: The cache_prompt parameter not working properly #10993

feikiss opened this issue Dec 27, 2024 · 0 comments

Comments

@feikiss
Copy link

feikiss commented Dec 27, 2024

Name and Version

version: 4149 (1bb30bf)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Problem description & steps to reproduce

The param "cache_prompt" seems not working as expected.
I'm using the llama.cpp in pure CPU env. Command to start the server: /llama-server -m /models/qwen2.5-1.5b-q8/qwen2.5-1.5b-instruct-q8_0.gguf -c 1024 --host 0.0.0.0 --port 8000 -dkvc --metrics --file /models/promts/tool.txt --keep -1
The URI I'm using: /v1/completions , body is as below:

{
"prompt":"LONG_PROMPT + short_question1,
"cache_prompt": true,
xxxx,// other params
}

If I invoke the API with the prompt format "LONG_PROMPT" + short_question again and again, it can always work well, and reply quickly, but when I asked one another prompt, just like "short_question2", then back to call with "LONG_PROMPT" + short_question1, it will coast much a long time. It seems the prompt cache for the "LONG_PROMPT" is lost if I change the prompt without the "LONG_PROMPT".

First Bad Commit

No response

Relevant log output

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant