This is an app that let's you ask questions about any data source by leveraging embeddings, vector databases, large language models and last but not least langchain
- A langchain is created consisting of a LLM model (
gpt-3.5-turbo
by default) and the vector store as retriever - When asking questions to the app, the chain embeds the input prompt and does a similarity search of in the vector store and uses the best results as context for the LLM to generate an appropriate response
- Finally the chat history is cached locally to enable a ChatGPT like Q&A conversation
- Retrieve links
If you like to contribute, feel free to grab any task
- Would like to change iter_all_posts function so it downloads questions that haven't been stored yet
- Perhaps we could use a snowflake or mongoDB database to store docs that are also in the vector database. This way, we could see what's in the vector database and what
- Add a linter
- Add a different template for posts that are notes
- Add weights to documents
- Some answers are formatted in html or md, make them into plain text when downloading
- Need to make embedded links accessible
- Add more documents:
- B&O textbook
- Slide decks
- Specs
- Random stuff from the calendar
- Would be cool if you could select which homework you're working on, kind of like selecting a folder for a Piazza post...
- Ensure that there are no contradictory answers:
- Make a training dataset of questions
- Ensure the answers are correct
- Negate all of the questions
- See what responses LLM produces
- Answers need to have links to documentation
- I can incorporate this relatively easily
- Obviously this was forked from someone else's repo, so would like to remove all the unnecessary stuff that's in here
- Update the .gitignore file
- I'm not sure what I meant exactly, but we shouldn't be uploading unnecessary/secret stuff once it's public
- Add different data sources for CS40, CS11, etc.
- Change the "Data source text_embeddings is ready to go with model gpt-3.5-turbo!" thing that pops up? to also say that you're authenticated or something
- Add a dark mode...
- Need basic information about the course too!
- Professor, language taught in, etc.
- There could just be 1 document for that. Could grab from SIS?
- Could grab through using requests library with course website?
- Professor, language taught in, etc.
- Find the textbooks on the internet and load them into the database too. It'd be really helpful if when a student asked a specific question they got feedback on the question.
- This warning pops up sometimes, would like to fix: "/Users/john.eastman/Desktop/Personal/TABot/piazza_data.py:28: MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup. soup = BeautifulSoup(html_text, 'lxml')"
- A "neither appeared" popped up, need to figure out what post that is. I think it's a poll
- Need to figure out optimal time to sleep for Piazza API, also is there a way to go rly fast but then when error pops up, we stop and wait then continue?
- Implement boosted retrieval/ping LLM again with boosted retreival if response is bad.
- We could have students label their answer in 1 of a few different ways, then use that as training data for the audit model.
- https://python.langchain.com/docs/modules/data_connection/retrievers/parent_document_retriever
- Figure out why the prompt template is getting printed so often.
- Run
poetry install
- Optional (recommended) add the python interpreter as its location'll be displayed as the first line after poetry install