Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get the ONLY original documents' chunks? #54

Open
daniyal214 opened this issue Sep 2, 2024 · 8 comments
Open

How to get the ONLY original documents' chunks? #54

daniyal214 opened this issue Sep 2, 2024 · 8 comments

Comments

@daniyal214
Copy link

daniyal214 commented Sep 2, 2024

While retrieval I did context, __ = RA.retrieve(question) to see the context, since I was not getting the desired response.
I noticed that the context that is being passed to qa_model to answer question self.qa_model.answer_question(context, question) is not the actual chunk of text. There are also summarized text in the node list when we do self.context_chunks = [node.text for node in node_list].
I wonder how can I get the actual chunks nodes only, since I desire to pass only the actual context text to qa model.

My actual docs has lots of tutorial urls (which i need in response) in various places. But every time I get the response, the urls are messed up, broken or missing. So I dig up to check and found out that qa not getting only the actual chunks.

Is there any way to retrieve the context which should be the original portions of documents?

Thanks.

@daniyal214 daniyal214 changed the title How to get the original documents' chunks? How to get the ONLY original documents' chunks? Sep 2, 2024
@daniyal214
Copy link
Author

@parthsarthi03

@parthsarthi03
Copy link
Owner

you can set start_layer=0 and num_layers=1 in either RA.retrieve or RA.answer_question. This will effectively set it to retrieve from only the leaf layer which is the original text. You can also also set the tb_num_layers while building the tree to not build summaries.

@daniyal214
Copy link
Author

Thanks for the response @parthsarthi03 . Alright, I'll try and get back to you.

One question if I set tb_num_layers to say 0. And it doesn't do summarization, so will this be helpful in any way?
Because it might act as Naive RAG where we input all the doc chunks to retriever to get the top_k let's say, because that summarization and clustering is the specialty of RAPTOR. Am i right?

@parthsarthi03
Copy link
Owner

Yes, it will act as naive RAG. Looking back at your original question, are you using do you want use the tree traversal method of RAPTOR and just filter for the last layer? Because the setting I mentioned will just restrict the retrieval to those layer effectively doing naive RAG.

@daniyal214
Copy link
Author

daniyal214 commented Sep 3, 2024

@parthsarthi03
Yes I want the tree traversal method of RAPTOR and get me the chunks of the original doc. That would be better choice as compared to Naive.

@parthsarthi03
Copy link
Owner

parthsarthi03 commented Sep 3, 2024

Ah, okay, that is a bit harder but doable. You'll have to add the following filter before the below line to filter for the leaf nodes.

selected_nodes = [node for node in selected_nodes if node.index in self.tree.leaf_nodes]

context = get_text(selected_nodes)

This should filter the selected nodes to only the leaf nodes. let me know if you run into any issues.

@daniyal214
Copy link
Author

Thanks @parthsarthi03 I can get the original doc chunks from the traversal tree method.

I have few questions:

  1. When we set start_layer=0 and num_layers=1 through which we get the current_nodes. This current_nodes is basically all the nodes of layer#1. And layer#1 is the first layers containing chunks of the original doc BEFORE any summarization is done. Am i right?
  2. And also I want to transform the retrieval_information function to get me the relevant chunks until max_token. So that I can have the maximum context, and will be of only original document. Is this possible with the help of current_nodes?

@parthsarthi03
Copy link
Owner

  1. if you want to use the tree traversal method of RAPTOR and only use the original docs, do not set start_layer=0 and num_layers=1, simply add the filter that I had given before. I believe, though. I'll have to check the indexing, that layer 0 should be the first layer before any summarization.
  2. yes, you can have a filter for max tokens. it is supported in the retrieve_information_collapse_tree function and you should be able to copy over the logic with some minor changes. You'll have to decide how to rank the nodes though. you can do something similar to retrieve_information_collapse_tree where you do cosine similarity and rank or have a fancier BFS/DFS search and stopping based on max_token. If you implement the second, feel free to send in a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants