In standalone (non-container) mode, the extension connects directly with an Ollama instance.
1. Install Ollama on your machine from Ollama Website.
-
Chat Model
ollama pull gemma:2b
-
Code Model
ollama pull codegemma:2b
Using different models for chat/code completion [Optional]
- Configure model used for chat in the extension (Settings > Local AI Pilot > ollamaModel).
- Configure model used for code completion in the extension (Settings > Local AI Pilot > ollamaCodeModel).
In Container Mode, the LLM API Container acts as a bridge between Ollama and the Extension, enabling fine grained customizations and advanced features like Document Q&A, Chat History(caching), Remote models.
-
Install Docker and Docker Compose
-
[Optional] GPU (NVIDIA) - Download and install NVIDIAยฎ GPU drivers
Checkout GPU support for more information.
docker-compose-cpu.yml | docker-compose-gpu.yml
docker compose -f docker-compose-cpu|gpu.yml up llmapi [ollama] [cache]
Container Services
- llmapi : LLM API container service that connects the extension with Ollama. All configurations are available through ENV variables.
- ollama [Optional] : Turn on this service for running Ollama as container.
- cache [Optional] : Turn on this service for caching and searching chat history
Tip
Start with the llmapi service. Add other services based on your needs.
Configuring Docker Compose to connect with Ollama running on localhost (via ollama app)
docker compose -f docker-compose-cpu|gpu.yml up llmapi
# update OLLAMA_HOST env variable to point localhost(host.docker.internal)
Chat History can be saved in Redis by turning on the cache service. By default, the chats are cached for 1 hour, which is configurable in docker compose. This also enables searching previous chats via extension by keyword or chat ID.
docker compose -f docker-compose-cpu|gpu.yml up cache
Start Q&A chat using Retrieval-Augmented Generation (RAG) and embeddings. Pull a local model to generate and query embeddings.
-
Embed Model
ollama pull nomic-embed-text
Use Docker Compose Volume (ragdir) to bind the folder containing documents for Q&A. The embeddings are stored in volume (ragstorage).
-
Pull your preferred model from ollama model library
ollama pull <model-name> ollama pull <code-model-name> ollama pull <embed-model-name>
-
Update model name in docker compose environment variable.
Note: Local models are prefixed by the word "local/"
MODEL_NAME: local/<model-name> CODE_MODEL_NAME: local/<code-model-name> EMBED_MODEL_NAME: local/<embed-model-name>
Remote models require API keys which can be configured in the Docker Compose file.
Supports the models of gemini, cohere, openai, anthropic LLM providers.
Update model name and model key in docker compose environment variables.
Turn down ollama service if it's running as it will not be used for remote inference.
docker compose down ollama
Supports {Provider}/{ModelName} format
-
Gemini
Create API keys https://aistudio.google.com/app/apikey
MODEL_NAME: gemini/gemini-pro EMBED_MODEL_NAME: gemini/embedding-001 API_KEY: <API_KEY> EMBED_API_KEY: <API_KEY>
-
Cohere
Create API keys https://dashboard.cohere.com/api-keys
MODEL_NAME: cohere/command EMBED_MODEL_NAME: cohere/embed-english-v3.0 API_KEY: <API_KEY> EMBED_API_KEY: <API_KEY>
-
Open AI
Create API keys https://platform.openai.com/docs/quickstart/account-setup
MODEL_NAME: openai/gpt-4o EMBED_MODEL_NAME: openai/text-embedding-3-large API_KEY: <API_KEY> EMBED_API_KEY: <API_KEY>
-
Anthropic
Create API keys https://www.anthropic.com/ and https://www.voyageai.com/
MODEL_NAME: anthropic/claude-3-opus-20240229 EMBED_MODEL_NAME: voyageai/voyage-2 API_KEY: <API_KEY> EMBED_API_KEY: <VOYAGE_API_KEY>
-
Mistral AI (codestral)
Create API keys https://console.mistral.ai/codestral
MODEL_NAME: mistralai CODE_MODEL_NAME: mistralai API_KEY: <API_KEY>
Models trained on large number of parameters (7b, 70b) are generally more reliable and precise. Though, small models like gemma:2b and phi3 have surprised everyone by delivering better performance. Ultimately, choosing the ideal local model depends on your system's resource capacity and model's performance.
Warning
Heavier models will require more processing power and memory.
You can choose any instruct model for chat. For better results, choose models that are trained for programming tasks.
gemma:2b | phi3 | llama3 | qwen2:1.5b | gemma:7b | codellama:7b
For code completion, choose code models that supports FIM (fill-in-the-middle)
codegemma:2b | codegemma:7b-code | codellama:code | codellama:7b-code | deepseek-coder:6.7b-base | granite-code:3b-base
Important
Instruct based models are not supported for code completion.
Choose any embed model
docker compose -f docker-compose-cpu|gpu.yml up ollama
# update OLLAMA_HOST env variable to "ollama"
ollama commands are now available via docker.
docker exec -it ollama-container ollama ls