-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server : improvements and maintenance #4216
Comments
Would love if the server could get look ahead decoding and contrastive search. A collection of common presets would be very helpful for fast model evaluation. The ability to edit responses and replies in the UI would be very useful for rapidly testing prompt branches if combined with batching capabilities. Would also appreciate a simple implementation of request queuing and a server interface for the model training example. Edit: Discussion link for contrastive search : #3450 , other related topics / potential substitutes are mentioned in the thread. |
Thanks for raising this issue and looking into the server example. I think this #4201 could be relevant - although it sounds like the fix will be in the core code rather than in the server. Since the addition of support for batching, llama.cpp could be come a viable competitor to vllm for large scale deployments. This is also helpful for individual hobbyists who are using/building AI agents (because these possibly make multiple requests in parallel to the LLMs to construct answers). So I think your suggestions around improving stability/refactor of the server example would be very valuable. Also focusing on the throughput speed particularly of batched requests (and benchmarking this against vllm). |
What'd be lovely is to see also the speculative sampling added to it - would be really a great addition there |
Very excited about this! I think that the server should increasingly be thought of as the main deliverable of this repo. There are 100s of libraries and tools that integrate different subset of backends and inference libraries. Especially in the python world. This doesn't make sense. We need a simple convention by which everything can interopt. The solution is to use openai's API as a protocol on localhost. Could there be better standards? Maybe. But this is the one we have, and it works really well. My suggestion is that clean we clean up the server and treat it and the This means that existing code only needs the api_url override to be modified to work locally.
This works already. At least as long as you are loading a model that conforms to chatml and are ok with the default context size. I find that a much better vision for how LLM interopt will work in the open source space. Different servers, different, backends, all on the same proto. |
This option to generate multiple alternatives for the same prompt requires the ability to change the seed, and the truth is, I've been having a bit of a struggle with it when adding parallel decoding, as it raises questions about how the seed should be managed. |
LocalAI serves this usecase quite well already and has lots of traction. It is better to not compete with your customers and delight them instead. An easy to link library with a C API should be the main deliverable of this project. |
The OAI API included with the server, is great I love it. These params are much needed. Thanks. |
I think it would be good if the OAI endpoint supported the same set of parameters and defaults as the regular endpoint and sensible or argument driven defaults given many clients won't supply all parameters. One issue is the seed is defaulting |
With respect, I think the server endpoint is a different audience. LocalAI seems to be going for an "everything and the kitchen sink" approach. That's cool, and I respect the project, but what I would like from the server example is something different: Raw inference with the greatest number of capabilities at the fastest possible speed, along with tooling specifically designed to allow for large scale prompt testing of different model variants quickly and easily. This is what I would view as more of a "production" workflow as opposed to more of a hobbyist workflow. I agree with the upthread sentiment around making the server api a solid standard @tobi. |
Sorry to jump-in in OT, but you are not sacrificing any speed nor capabilities with LocalAI - at the end the engine is always the same (llama.cpp, or vllm, or you name it) - however I see the value of having a server in llama.cpp. It's people's choice at the end of what suits better their needs. And also, the server LocalAI implementation is heavily based on that ;)
For production there are quite some issues that are blockish-imho rather than this. Had several bugs in LocalAI w/ llama.cpp which makes it still difficult to navigate into that direction, which I hope gets addressed with this ticket. Things like #3969 are quite scary for prod-users. |
Just a thought as a user of llama.cpp server: I imagine it's quite common for the llama.cpp Server to be used by developers who are able to add non core functionality in their own code. (e.g. Devs create their own application or library or REST server that wraps/orchestrates llama.cpp). Naturally the llama.cpp server is very convenient for this and works with any programming language. It also has a smaller/self contained API to learn. I think some of the following can be done in dev's own code outside of llama.cpp:
(Disclaimer: These are just examples, I haven't fully evaluated the pros/cons of implementing them outside of llama.cpp) It's excellent if this project has the mission and bandwidth to provide functionalities like these. But if it sounds like its becoming too much work or feature creep then I imagine focusing on the bits that are impossible to do outside of llama.cpp is one of the ways to prioritise. |
Hi, @ggerganov .The vllm project has a PR under construction for a chat template that can be used as a reference. vllm-project/vllm#1756 |
Regarding chat templates: I see they are using something called Jinja. We are not going to implement Jinja in The best we can do are 2 things:
|
@ggerganov If you are going to hard code templates, this server will be totally unusable for a large number of users. I am experimenting with new templates, and would really rather the models trained with them be widely supported. Hell, there are so many variations of the I mentioned on the other ticket that there is: https://github.com/jinja2cpp/Jinja2Cpp Maybe that can be an optional component to add support for chat templates from the tokenizer, and hard coding can be the default code-path, I understand not wanting to add additional dependencies. Getting the jinja string in the client is not helpful as an API endpoint, unless there is a client side compatibility layer between the I had opened a issue for chat template support a while ago, when I started working on it for vLLM: #3810 I implemented this for vLLM, and after going through a few rounds of testing, I had to rework things up and add additional parameters, and cli arguments to support the API properly. Here is the diff for my The important points from the vLLM pull request:
The request.echo is an extension of the api, due to the nature of Open Source LLMs being able to finish the last role:content pair in the messages list if request.add_generation_prompt=false (which is also an extension of the API due to the need to support this HF feature) and the template/model support that feature. We should treat add_generation_prompt as default=true, because that is the behavior of the OpenAI API. This simply allows users to override that behavior if they need it, and gives them all the required knobs to use the feature properly. |
my personal thoughts here, but probably C++ ain't the best language for that - templating is quite easy to implement in scripted languages rather than C++, and in my opinion would undermine the maintenance and flexibility to have a lean Just my 2c, but maybe templating fits better on top of llama-cpp-python - which might be easier to go and to maintain (while keeping the core small and extensible)? |
All templates that I've seen so far are so basic that I don't understand why we need an entire scripting language to express them. Is there a more advanced use case other than a basic How many templates do we expect to ever have? 10s, 100s? Even if it is 1000s, I prefer to have them hardcoded instead of building Here is sample ChatML template in a few lines of C++ that we currently use (and this is not even the best way to do it): std::string format_chatml(std::vector<json> messages)
{
std::ostringstream chatml_msgs;
for (auto it = messages.begin(); it != messages.end(); ++it) {
chatml_msgs << "<|im_start|>"
<< json_value(*it, "role", std::string("user")) << '\n';
chatml_msgs << json_value(*it, "content", std::string(""))
<< "<|im_end|>\n";
}
chatml_msgs << "<|im_start|>assistant" << '\n';
return chatml_msgs.str();
} I could be missing something, but for the moment I don't see a good reason to add Jinja support. Let's see how it goes - I'm open to reconsider, but need to see some reasonable examples and use cases that justify this dependency.
I think I understand the Yes, I agree. |
The fact is, if the rest of the ecosystem standardizes on these templates being "the way" to format messages, it will proliferate to new and unexpected use cases.
Here is an example call using my inkbot template which uses echo:
Which returns:
vs with
Since the official OpenAI API for Edit: |
In my opinion, most of these projects based on ggml have the characteristic of being very lightweight with few dependencies (headers library: httplib.h json.hpp stb_image.h and others), making them portable compared to having to download a 2 GB library like PyTorch and the entire Python environment that downloads packages that will never be used. Adding overly heavy dependencies, especially those dependent on an external language like Python, seems to go against the idea of this type of project. |
Absolutely no one is advocating for a whole pytorch dependency chain. There just may be other options for running the jinja that don't bloat the dependency chain too badly, and I very much think it's worth discussing further to see if there is an acceptable solution that can be found. Even if it's something like transpiling jinja to another language that we can directly run, or providing hooks for users to run a python interpreter and the jinja dependency to give the results back to the main cpp program. That way it can be optional, and fall back to hard coded options if unavailable. Just some thoughts, take them for what you will, I am not a cpp dev. |
I would suggest something like creating a small utility that performs the functionality we are interested in using C++ (porting it). Analyzing the Jinja2cpp library quickly, it has Boost as a dependency, which explains the long CMake configuration time. It could be beneficial to decouple that library and include only the necessary functions for Jinja2cpp to work, making it more lightweight. |
@tobi completely agree that server.cpp should be a first-class focus of this repo. My macOS app uses exactly the architecture you describe, hitting I also wanted to throw in an example of some ugly code I'd love to kill with built-in server.cpp templating. I'm guessing every server.cpp client has some version of this and I'm sure they all have slightly different bugs: https://github.com/psugihara/FreeChat/blob/main/mac/FreeChat/Models/NPC/PromptTemplates/Templates.swift @Tostino After understanding more of the background here, I agree that ideally we'd want to support the jinja templates included in GGUFs. I didn't even know these were added to GGUF, that's so cool! Unfortunately I'm not seeing a ton of existing work in cpp besides the relatively heavyweight jinja2cpp you found as well. Implementing a minimal jinja2 parser seems out of scope for v1 of template support but perhaps a more incremental compromise could work...
I agree with @ggerganov that the templates are pretty trivial to implement in c++ or whatever and I'd first and foremost just like to have them all in one place (ideally llama.cpp) rather than bespoke implementations in each client. A mapping from jinja template hashes to c++ functions would be the most performant setup too, even if it's a bit ugly conceptually. If templates are added here, I can delete my implementation in FreeChat so we'll have net 0 fragmentation :) |
That isn't possible. You can template your response on the client side, but then you need to hit the legacy |
For my use-case that would be fine. Though it does look like there are some non-standard args supported by |
@ggerganov Is there any way to do the following with llamacpp? Sorry for my poor use of Paint, but I wanted to convey my idea for improving the way we handle requests from different clients more efficiently and conveniently, at least for applications like chatbots. When a client sends a PDF document for processing, other clients shouldn't get stuck, and should continue to receive tokens continuously. |
@FSSRepo to elaborate on what I mentioned before. The way you can do this is by splitting all the work into small batches. If the server receives a 2048 token prompt from one client, you don't process all of it in a single evaluation. For example evaluate the first 512 tokens only. If it receives another request from another client for 256 tokens in the meanwhile, then in the next batch evaluate 256 tokens from the first client and 256 from the new client. And so on until all the requests have been processed, with every new batch keep distributing fairly all the work queued from the different clients. |
Yes, what @slaren said. The API is flexible and server implements just one possible way of batching the inputs |
Maybe we can divide the |
Server HTTP API CORS is broken #6544 . |
I put together a PR with some themes and an example of how people can skin/re-theme it. Anyone interested in approving? Note that graphic design is NOT my specialty. |
I noticed that the OpenAI semi-compatible API defaults the temperature to 0, but this is different than OpenAI's actual API which defaults to 1. This can make applications that assume a non-0 value default (such as Oatmeal) much more frustrating to use. Curious if there's a reason the default differs? If it's worth changing this, I've already set up a PR: #7226 |
I've thought of something that I can't find if it has been suggested already as not sure what words best describe it:
It would be very helpful if we could somehow send a set of overrides for the parameters to be sent via the API (possibly using some JSON file as input). I wanted originally to use the This would also solve the person above's problem of the temperature being default of zero too and let them override it. I think it would be pretty straightforward to implement as you could just have a JSON file as input that follows the exact same format as the API uses, and then use it to set new default values. It's also possible this CLI option could be adapted:
as it already loads and parses a JSON file for a few parameters.... If anybody is interested then I can probably take a look over the weekend at adding this? It could work in 2 ways too:
|
well, i suppose it's a breaking change and not really germane to the server per se, but i propose the leadership consider the project could be renamed something other than "llama" because that's one model from one company and this framework can surely be useful for other models besides llama from meta right? i also think it's distateful but impossible to avoid the implications of the way the following rules and prohibitions on learning exist 100% to stifle innovation in the AI industry: https://ai.meta.com/llama/license/ """
... non-offensive parts omitted for brevity ... c. If you institute litigation or other proceedings against Meta or any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Llama Materials or Llama 2 outputs or results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable by you, then any licenses granted to you under this Agreement shall terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold harmless Meta from and against any claim by any third party arising out of or related to your use or distribution of the Llama Materials. What makes us confident it's a good idea to tie the name of this project to the name of meta's open weights but not OSI approved license llm? e.g. for future options to use other AI models, are we sure we ought to sink hundreds of engineer years into a project named after a corpo ai model with explicitly monopolistic business terms? i don't mean to mount a crusade here, just seems reasonable to rename llama.cpp to something else which sounds useful for more than just talking to llama (which produces outputs which we can't even use for other future AI !) and post this here because this is where the feedback link leads; bottom line for me personally is, I wouldn't ever use this for Llama but i'd consider it for Mistral |
Sorry to say it, but I think this would be a terrible idea and cause lots of confusion. Think of it like "Hoover" or "Biro" where the default make/manufacturer became the colloquial term. |
That's fair, I realize momentum behind current naming makes renaming a nightmare. However, from a legal perspective (IANAL) if their license isn't compatible, then it's a big concern for long term use. I wrote that fully realizing it probably wouldn't happen. please know I don't like this topic. I didn't write the llama license. Seems license incompatibility of Llama and LLM Compiler with OSI-approved projects is a big concern. even if nobody reads the fine print or gives a crap, the fact users could get rug pulled / sued / whatever by a megacorp like meta for using llama outputs or the llama name, makes these things exponentially less valuable, even dangerous to use. My suggestion is to be careful, take the legal stuff seriously, and consult a real lawyer about license / trademark compatibility issues, and pester meta to relicense their models so folks don't need to worry if they can share the output or use it for work stuff. |
fyi, my prior concern was fixed amidst the release of llama 3.1 |
Any plan to support parallel prompts evaluation ? Or am I getting this wrong ? there is discussion #3363 that talks about it, but it's a little old. Right now, if there are 3 slots, two runing requests at the token generation step, sending a new request will kinda "freeze" the others requests (depending on pp batch size). Edit : batched example uses as input the same prompt so yeah ... this is not exactly what we need. |
This is the expected behaviour. You can reduce the freezing effect by using a smaller batch size, but this can negatively affect the overall performance. Some dynamic batch size adjustments could be easily implemented though to improve certain use cases. |
Thanks for the answer. |
The server example has been growing in functionality and unfortunately I feel it is not very stable at the moment and there are some important features that are still missing. Creating this issue to keep track on some of these points and try to draw more attention from the community. I guess, some of the tasks are relatively big and would require significant efforts to complete
Support chat templates
We need to have separation between the user input and the special tokens, so that the tokenization is performed correctly. See the following comments / commits for more context:
Server: OpenAI-compatible POST /v1/chat/completions API endpoint #4160 (comment)
c544fae
Server: OpenAI-compatible POST /v1/chat/completions API endpoint #4160 (comment)
We already support extracting meta information from the GGUF model files that can provide the chat template for the specific model:
gguf-py : export chat templates #4125
Support chat template for
/v1/chat/completions
: Server: use llama_chat_apply_template #5593List of supported templates: view on wiki
Supporting this in
server
would require changes both in the backend and the frontendLikely redundant logic for OpenAI (OAI) compatibility that should be removed
server : OAI API compatibility #4198 (comment)
Use multiple mount points for the OAI API
llama.cpp/examples/server/server.cpp
Lines 2682 to 2684 in af19d35
Add "/chat/completions" as alias for "/v1/chat/completions" #5722
Return meaningful errors on KV cache overflow
update_slots : failed to decode the batch #4185 (comment)
Refactor the code
With the recent additions for parallel decoding support for multiple clients and LLaVA, I feel the code base became very cumbersome and there is a lot of room for refactoring and improving the code. There should be some effort dedicated to cleaning up things and simplifying the code.
Server: try to refactor server.cpp #5065
Server: Improve work queue stability #5710
Batched decoding endpoint?
Although we added parallel decoding support via "slots", we are still lacking batched decoding where a single client could pass an array of prompts to be completed. Or alternatively, generate multiple completions for a single prompt. Would be useful to support this use case
llama : add batched inference endpoint to server #3478 (comment)
Tool calls (function calling)
Support for MeetKai/functionary model by implementing OpenAI-compatible tool calls to chat endpoint.
Server: add support for "tool_calls" (MeetKai/functionary model) #5695
Multimodal support
Support has been temporary dropped in server : refactor #5882, before working in
server
, we should improvellava-cli
and the API for using LLaVAPrompt processing improvment
Server production readiness
This is likely not a complete list of things - if you think some feature is important to be improved or supported, drop a comment.
Have a look to issues labelled with server/webui.
The text was updated successfully, but these errors were encountered: