Skip to content

A Python package for indexing Compendium data (generated by Compendium Scribe) into a vector database (like Pinecone) to power Retrieval-Augmented Generation (RAG) workflows

License

Notifications You must be signed in to change notification settings

btfranklin/compendiumkeeper

Repository files navigation

Compendium Keeper

Compendium Keeper is a tool that indexes Compendium data (generated by Compendium Scribe) into a vector database (like Pinecone) to power Retrieval-Augmented Generation (RAG) workflows.

Features

  • Easily index concepts from a Compendium into a vector store.
  • Embed concept content, questions, and keywords using OpenAI embeddings.
  • Store embeddings and metadata in a vector database for quick retrieval.
  • Supports multiple vector databases through an extensible architecture.
  • Handles both .compendium.pickle and .compendium.xml file formats.

Requirements

  • Python 3.12+
  • Compendium Scribe must be installed.
  • OpenAI API key.
  • Pinecone API key and environment (if using Pinecone).

Installation

  1. Clone the Repository
git clone https://github.com/yourusername/compendiumkeeper.git
cd compendiumkeeper
  1. Install Dependencies

Ensure you have PDM installed. Then run:

pdm install

Configuration

Create a .env file in the root directory of the project to store your API keys and configuration. You can use the provided .env.example as a template.

Example .env File

# .env.example

# OpenAI API Key for generating embeddings
OPENAI_API_KEY=sk-your-openai-api-key

# Pinecone API Key and Environment
PINECONE_API_KEY=your-pinecone-api-key
PINECONE_ENVIRONMENT=us-east-1-aws

Rename .env.example to .env and replace the placeholder values with your actual API keys.

Usage

  1. Generate a Compendium using Compendium Scribe
compendium-scribe-create-compendium --domain "Cell Biology"

This produces files like cell_biology_2024-12-05.compendium.pickle and cell_biology_2024-12-05.compendium.xml.

  1. Index the Compendium

Use the --compendium-file option to specify the Compendium file (pickle or XML).
You must also specify the vector database index name using the --index-name option.

Ensure your .env file is properly configured with the necessary API keys.

Index from a Pickle File

pdm run compendium-keeper index --compendium-file cell_biology_2024-12-05.compendium.pickle --index-name my_knowledge_index

Index from an XML File

pdm run compendium-keeper index --compendium-file cell_biology_2024-12-05.compendium.xml --index-name my_knowledge_index
  1. Verify Indexing

After successful execution, you should see a confirmation message indicating the number of concepts indexed.

Indexed 25 concepts from domain 'Cell Biology' into index 'my_knowledge_index'.
Indexing complete!
  1. Combine Multiple Compendia

To create a single knowledge base that spans multiple Compendia, repeat the indexing process for each Compendium, using the same --index-name.

For example:

pdm run compendium-keeper index --compendium-file django_2024-12-10.compendium.pickle --index-name all_python_knowledge
pdm run compendium-keeper index --compendium-file flask_2024-12-10.compendium.xml --index-name all_python_knowledge

This will merge the knowledge from multiple Compendia into the same vector database index.

Extensibility

  • Multiple Vector Databases: The architecture allows for adding support for other vector databases (e.g., Weaviate, ChromaDB) by implementing new classes in the vector_db/ directory.
  • Custom Embedding Strategies: Modify or extend utils.py to customize how embeddings are generated or processed.

Developer Workflow

  1. Set Up Environment Variables

    Create a .env file as described above.

  2. Generate a Compendium

    Use Compendium Scribe to generate a Compendium in pickle or XML format.

  3. Index with Compendium Keeper

    Run the indexing command to upload embeddings to your chosen vector database.

Troubleshooting

  • Missing API Keys

    Ensure that your .env file contains all required API keys. The CLI will notify you if any are missing.

  • Unsupported Vector DB

    Currently, only Pinecone is supported. To add support for another vector database, implement a new class in vector_db/ adhering to the VectorDatabase abstract base class.

  • File Format Issues

    Ensure that the --compendium-file you provide ends in either .compendium.pickle or .compendium.xml. Files with other extensions are not supported.

  • API Rate Limits

    Be mindful of OpenAI's API rate limits when indexing large Compendia. Consider implementing batching or rate limiting if necessary.

Contributing

Contributions are welcome! Feel free to open an issue or submit a pull request.

License

Compendium Keeper is released under the MIT License.

About

A Python package for indexing Compendium data (generated by Compendium Scribe) into a vector database (like Pinecone) to power Retrieval-Augmented Generation (RAG) workflows

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages