Compendium Keeper is a tool that indexes Compendium data (generated by Compendium Scribe) into a vector database (like Pinecone) to power Retrieval-Augmented Generation (RAG) workflows.
- Easily index concepts from a Compendium into a vector store.
- Embed concept content, questions, and keywords using OpenAI embeddings.
- Store embeddings and metadata in a vector database for quick retrieval.
- Supports multiple vector databases through an extensible architecture.
- Handles both
.compendium.pickle
and.compendium.xml
file formats.
- Python 3.12+
- Compendium Scribe must be installed.
- OpenAI API key.
- Pinecone API key and environment (if using Pinecone).
- Clone the Repository
git clone https://github.com/yourusername/compendiumkeeper.git
cd compendiumkeeper
- Install Dependencies
Ensure you have PDM installed. Then run:
pdm install
Create a .env
file in the root directory of the project to store your API keys and configuration. You can use the provided .env.example
as a template.
# .env.example
# OpenAI API Key for generating embeddings
OPENAI_API_KEY=sk-your-openai-api-key
# Pinecone API Key and Environment
PINECONE_API_KEY=your-pinecone-api-key
PINECONE_ENVIRONMENT=us-east-1-aws
Rename .env.example
to .env
and replace the placeholder values with your actual API keys.
- Generate a Compendium using Compendium Scribe
compendium-scribe-create-compendium --domain "Cell Biology"
This produces files like cell_biology_2024-12-05.compendium.pickle
and cell_biology_2024-12-05.compendium.xml
.
- Index the Compendium
Use the --compendium-file
option to specify the Compendium file (pickle or XML).
You must also specify the vector database index name using the --index-name
option.
Ensure your .env
file is properly configured with the necessary API keys.
pdm run compendium-keeper index --compendium-file cell_biology_2024-12-05.compendium.pickle --index-name my_knowledge_index
pdm run compendium-keeper index --compendium-file cell_biology_2024-12-05.compendium.xml --index-name my_knowledge_index
- Verify Indexing
After successful execution, you should see a confirmation message indicating the number of concepts indexed.
Indexed 25 concepts from domain 'Cell Biology' into index 'my_knowledge_index'.
Indexing complete!
- Combine Multiple Compendia
To create a single knowledge base that spans multiple Compendia, repeat the indexing process for each Compendium, using the same --index-name
.
For example:
pdm run compendium-keeper index --compendium-file django_2024-12-10.compendium.pickle --index-name all_python_knowledge
pdm run compendium-keeper index --compendium-file flask_2024-12-10.compendium.xml --index-name all_python_knowledge
This will merge the knowledge from multiple Compendia into the same vector database index.
- Multiple Vector Databases: The architecture allows for adding support for other vector databases (e.g., Weaviate, ChromaDB) by implementing new classes in the
vector_db/
directory. - Custom Embedding Strategies: Modify or extend
utils.py
to customize how embeddings are generated or processed.
-
Set Up Environment Variables
Create a
.env
file as described above. -
Generate a Compendium
Use Compendium Scribe to generate a Compendium in pickle or XML format.
-
Index with Compendium Keeper
Run the indexing command to upload embeddings to your chosen vector database.
-
Missing API Keys
Ensure that your
.env
file contains all required API keys. The CLI will notify you if any are missing. -
Unsupported Vector DB
Currently, only Pinecone is supported. To add support for another vector database, implement a new class in
vector_db/
adhering to theVectorDatabase
abstract base class. -
File Format Issues
Ensure that the
--compendium-file
you provide ends in either.compendium.pickle
or.compendium.xml
. Files with other extensions are not supported. -
API Rate Limits
Be mindful of OpenAI's API rate limits when indexing large Compendia. Consider implementing batching or rate limiting if necessary.
Contributions are welcome! Feel free to open an issue or submit a pull request.
Compendium Keeper is released under the MIT License.