An OpenAI Compatible Web Server for llama.cpp #795

abetlen · 2023-04-05T22:06:54Z

abetlen
Apr 5, 2023
Collaborator

Hey everyone,

Just wanted to share that I integrated an OpenAI-compatible webserver into the llama-cpp-python package so you should be able to serve and use any llama.cpp compatible models with (almost) any OpenAI client.

Check out the README but the basic setup process is

pip install llama-cpp-python[server]
export MODEL=./models/7B
python3 -m llama_cpp.server

Then just navigate to http://localhost:8000/docs to start playing around with it using the Swagger UI.

In terms of compatibility I've tested it with the official OpenAI python library by just swapping out openai.api_base for the server URL and it seems to work. I've also had success using it with @mckaywrigley chatbot-ui which is a self hosted ChatGPT ui clone you can run with docker. Just launch with -e OPENAI_API_HOST=<api-url> to get started.

docker run -e OPENAI_API_HOST=<api-url> -e OPENAI_API_KEY="" -p 3000:3000 ghcr.io/mckaywrigley/chatbot-ui:main

Caveats

Chat completion is quite slow until I can implement a solution to cache parts of the llama state (if anyone can help with this, very much appreciated)
Certain features aren't implemented yet like logprobs and anything that's OpenAI specific but llama.cpp doesn't support like best_of parameter is just ignored silently.
Tools which rely on tiktoken or some other OpenAI model-specific tokenizer may not work or be buggy, just a heads up

Spiritdude · 2023-05-10T17:14:09Z

Spiritdude
May 10, 2023

@abetlen thanks a lot for your Py wrapper.

Just to add some additional settings for others, HOST=0.0.0.0 makes it available to other clients on the same network (be aware there is no password protection), and PORT is the port, e.g.

pip install llama-cpp-python[server]
export MODEL=./models/7B HOST=0.0.0.0 PORT=2600
python3 -m llama_cpp.server

Assuming 192.168.0.1 as the server IP, then when using original OpenAI's openai python module, you can then set openai.api_base, to use the above example:

openai.api_base = "http://192.168.0.1:2600/v1"

and for the chatbot-ui:

docker run -e OPENAI_API_HOST=http://192.168.0.1:2600 -e OPENAI_API_KEY=dummy -p 3000:3000 ghcr.io/mckaywrigley/chatbot-ui:main

Note: there is no trailing '/' in OPENAI_API_HOST, otherwise chatbot-ui fails to connect, also as of version 2023/05/10, OPENAI_API_KEY must contain some data, cannot be empty as I noticed too.

0 replies

lucasjinreal · 2023-06-16T07:08:46Z

lucasjinreal
Jun 16, 2023

Hello, where to specifica the prompt template?

1 reply

ehartford Sep 12, 2023

In order to use /v1/chat/completions api then there should be some preset templates (ie: Vicuna, ChatML, Alpaca, Llama-2-Chat) but ideally we should be able to specify a custom template too.

lucasjinreal · 2023-06-16T07:20:23Z

lucasjinreal
Jun 16, 2023

@abetlen Your code can not be run:

\llama_cpp\llama.py", line 1435, in del
if self.ctx is not None:
AttributeError: 'Llama' object has no attribute 'ctx'

0 replies

lucasjinreal · 2023-06-16T07:27:45Z

lucasjinreal
Jun 16, 2023

@abetlen thank u, it worked!

How to specificy the prompt template and params anyway?

0 replies

12lxr · 2023-06-20T01:30:12Z

12lxr
Jun 20, 2023

请问这是什么情况no matches found: llama-cpp-python[server]

1 reply

ianscrivener Jun 20, 2023

try pip install 'llama-cpp-python[server]'

aotsukiqx · 2023-06-24T23:43:13Z

aotsukiqx
Jun 24, 2023

Could you give an example for using openai call the web server?
I tried just add parameter openai_api_base="http://xxx.myhost:8000/v1" but always get response 422.
but same code runs ok when using fastchat web api...confused... 😂

1 reply

ianscrivener Jun 24, 2023

Documentation for llama-cpp-python is here: https://abetlen.github.io/llama-cpp-python/.

try without the /v1.. ie openai_api_base=http://xxx.myhost:8000

razorback16 · 2023-08-04T01:42:39Z

razorback16
Aug 4, 2023

Can we add grammar support #1773 to the llama-cpp-python web-server? Currently that option is not there

0 replies

xxxzsgxxx · 2023-08-08T10:28:38Z

xxxzsgxxx
Aug 8, 2023

After installing llama-cpp-python, I can create a container using the following command:

docker run --name chatbot -e OPENAI_API_HOST=http://127.0.0.1:8000/ -e OPENAI_API_KEY="" -p 3000:3000 ghcr.io/mckaywrigley/chatbot-ui:main

When accessing chatbot-ui in the browser, it always prompts that there is no OpenAI API key. What should I do? How to find the key of API-KEY？

thanks

1 reply

Spiritdude Aug 8, 2023

I wrote earlier (1st reply near top):

Note: [...] also as of version 2023/05/10, OPENAI_API_KEY must contain some data, cannot be empty as I noticed too.

So do OPENAI_API_KEY="something"

jasonacox · 2023-09-11T03:59:02Z

jasonacox
Sep 11, 2023

Add GPU Support for Server

To add Nvidia GPU support:

# Install Server with OpenAI Compatible API - with CUDA GPU support
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python[server]

# Run the Server using LLaMA-2 7B 5-bit quantized model
python3 -m llama_cpp.server \
    --model ./models/llama-2-7b-chat.Q5_K_M.gguf \
    --host localhost \
    --n_gpu_layers 32

Simple Command Line Chatbot

Here is a simple python CLI chatbot for the server: chat.py - with features:

Use of OpenAI API library (could also be used to connect to the OpenAI service if you have a key)
Works with local hosted OpenAI compatible llama-cpp-python[server]
Retains conversational context during session
Uses response stream to render LLM chunks in real time instead of waiting for full response

Tested on Ubuntu Linux host with an Intel i5-6500 CPU @ 3.20GHz, 8GB RAM and an Nvidia GTX 1060 GPU with 6GB VRAM - Approx 12 tokens/second.

1 reply

richginsberg Dec 7, 2024

Thanks for this. Initially not working since I had in pip cache from stock install. After removing cache it began building wheels.

pip uninstall llama-cpp-python[server]
pip cache remove llama_cpp_python
CMAKE_ARGS="-DGGML_VULKAN=on" pip install llama-cpp-python

carlonngo · 2023-11-03T09:17:11Z

carlonngo
Nov 3, 2023

Hey got this to work on my end. Thanks for sharing.

I noticed that the completion requests are handled sequentially. Is there a way to set it to do multi processing?

3 replies

jasonacox Nov 6, 2023

Are you talking about llama_cpp.server or a chat client? Since the server uses the GPU for inference, it seems like it would need to be sequential. By default, the server will interrupt previous requests, but you can send a --interrupt_requests False flag to prevent that with something like this:

# Start server
python3 -m llama_cpp.server \
    --model "$MODEL" \
    --host $HOST \
    --port $PORT \
    --interrupt_requests False \
    --n_gpu_layers $N_GPU_LAYERS

There are many options with chat clients. I created a simple multi-threaded multi-session web based one that adds RAG and external URL sourcing, here: https://github.com/jasonacox/TinyLLM/tree/main/chatbot . It works with llama_cpp.server and in my tests using the above, a request will queue up waiting for the previous inference to complete.

carlonngo Nov 6, 2023

Hey @jasonacox thanks for the response! Yeah, I'm referring to the llama_cpp.server.

Sorry if the question is stupid...
Any suggestions on how you would setup a server that can serve completion requests concurrently? Would we really need one GPU per request to achieve this?

jasonacox Nov 12, 2023

It's a good question! I would love to see this too. The answer is complicated from what I see. I suspect it has to do with unlocking "continuous batching". I found a few references that might help:

tpfau · 2023-11-06T10:51:47Z

tpfau
Nov 6, 2023

I got a question about this server.
From a quick glance it seems like this would be a very convenient and easy way to set up a local API server for Llama models.
The only thing I'm wondering is how to best add things like authentication/logging/usage monitoring?
Would you suggest to simply build something based on the code, or do you have suggestions on how we could integrate it into the current server? Along with making certain end-points available/not available?

1 reply

ianscrivener Nov 16, 2023

questions for llama-cpp-python are perhaps best posted on that project's repo: https://github.com/abetlen/llama-cpp-python/discussions

housebyte · 2023-11-09T13:47:01Z

housebyte
Nov 9, 2023

Hi thanks for all you great work at providing a wrapper with a web server. I got the wrapper working on my cpu but I have a ROCm system. I have llama.cpp fully working on my GPU so I have tried to compile llama_cpp_python with
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python[server] --force-reinstall --upgrade --no-cache-dir

However this gets stuck as my cpu has 6 cores and I think Ninja is using 100% of these threads and this results in the following:

vendor/llama.cpp/CMakeFiles/llama.dir/llama.cpp.o -c /tmp/pip-install-nijs8rwv/llama-cpp-python_6a0e9af7a84f46bd933b9ed621d1d190/vendor/llama.cpp/llama.cpp ninja: build stopped: subcommand failed. *** CMake build failed [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for llama-cpp-python Failed to build llama-cpp-python ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects

Hope you can advise. I need to know how to specify to cmake to use fewer threads I think. Alas the options

--build -j nthreads

Doesnt work. I am using a 6 core amd Ryzen 5 setup

1 reply

ianscrivener Nov 16, 2023

questions for llama-cpp-python are perhaps best posted on that project's repo: https://github.com/abetlen/llama-cpp-python/discussions

EduardoV06 · 2023-11-16T19:10:15Z

EduardoV06
Nov 16, 2023

Hi @abetlen, thanks for llama-cpp-python. Have some way to add a "stop" on chat/completions json schema like have on the completions?

1 reply

ianscrivener Nov 16, 2023

questions for llama-cpp-python are perhaps best posted on that project's repo: https://github.com/abetlen/llama-cpp-python/discussions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An OpenAI Compatible Web Server for llama.cpp #795

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 13 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

An OpenAI Compatible Web Server for llama.cpp #795

abetlen Apr 5, 2023 Collaborator

Replies: 13 comments · 11 replies

Add GPU Support for Server

Simple Command Line Chatbot

abetlen
Apr 5, 2023
Collaborator

Replies: 13 comments 11 replies