How to get the ONLY original documents' chunks? #54

daniyal214 · 2024-09-02T12:57:10Z

While retrieval I did context, __ = RA.retrieve(question) to see the context, since I was not getting the desired response.
I noticed that the context that is being passed to qa_model to answer question self.qa_model.answer_question(context, question) is not the actual chunk of text. There are also summarized text in the node list when we do self.context_chunks = [node.text for node in node_list].
I wonder how can I get the actual chunks nodes only, since I desire to pass only the actual context text to qa model.

My actual docs has lots of tutorial urls (which i need in response) in various places. But every time I get the response, the urls are messed up, broken or missing. So I dig up to check and found out that qa not getting only the actual chunks.

Is there any way to retrieve the context which should be the original portions of documents?

Thanks.

The text was updated successfully, but these errors were encountered:

daniyal214 · 2024-09-03T07:21:05Z

@parthsarthi03

parthsarthi03 · 2024-09-03T08:26:39Z

you can set start_layer=0 and num_layers=1 in either RA.retrieve or RA.answer_question. This will effectively set it to retrieve from only the leaf layer which is the original text. You can also also set the tb_num_layers while building the tree to not build summaries.

daniyal214 · 2024-09-03T08:40:07Z

Thanks for the response @parthsarthi03 . Alright, I'll try and get back to you.

One question if I set tb_num_layers to say 0. And it doesn't do summarization, so will this be helpful in any way?
Because it might act as Naive RAG where we input all the doc chunks to retriever to get the top_k let's say, because that summarization and clustering is the specialty of RAPTOR. Am i right?

parthsarthi03 · 2024-09-03T08:49:57Z

Yes, it will act as naive RAG. Looking back at your original question, are you using do you want use the tree traversal method of RAPTOR and just filter for the last layer? Because the setting I mentioned will just restrict the retrieval to those layer effectively doing naive RAG.

daniyal214 · 2024-09-03T08:52:32Z

@parthsarthi03
Yes I want the tree traversal method of RAPTOR and get me the chunks of the original doc. That would be better choice as compared to Naive.

parthsarthi03 · 2024-09-03T09:18:11Z

Ah, okay, that is a bit harder but doable. You'll have to add the following filter before the below line to filter for the leaf nodes.

selected_nodes = [node for node in selected_nodes if node.index in self.tree.leaf_nodes]

raptor/raptor/tree_retriever.py

Line 249 in 7da1d48

context = get_text(selected_nodes)

This should filter the selected nodes to only the leaf nodes. let me know if you run into any issues.

daniyal214 · 2024-09-03T11:33:13Z

Thanks @parthsarthi03 I can get the original doc chunks from the traversal tree method.

I have few questions:

When we set start_layer=0 and num_layers=1 through which we get the current_nodes. This current_nodes is basically all the nodes of layer#1. And layer#1 is the first layers containing chunks of the original doc BEFORE any summarization is done. Am i right?
And also I want to transform the retrieval_information function to get me the relevant chunks until max_token. So that I can have the maximum context, and will be of only original document. Is this possible with the help of current_nodes?

parthsarthi03 · 2024-09-03T11:50:53Z

if you want to use the tree traversal method of RAPTOR and only use the original docs, do not set start_layer=0 and num_layers=1, simply add the filter that I had given before. I believe, though. I'll have to check the indexing, that layer 0 should be the first layer before any summarization.
yes, you can have a filter for max tokens. it is supported in the retrieve_information_collapse_tree function and you should be able to copy over the logic with some minor changes. You'll have to decide how to rank the nodes though. you can do something similar to retrieve_information_collapse_tree where you do cosine similarity and rank or have a fancier BFS/DFS search and stopping based on max_token. If you implement the second, feel free to send in a PR.

daniyal214 changed the title ~~How to get the original documents' chunks?~~ How to get the ONLY original documents' chunks? Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get the ONLY original documents' chunks? #54

How to get the ONLY original documents' chunks? #54

daniyal214 commented Sep 2, 2024 •

edited

Loading

daniyal214 commented Sep 3, 2024

parthsarthi03 commented Sep 3, 2024

daniyal214 commented Sep 3, 2024

parthsarthi03 commented Sep 3, 2024

daniyal214 commented Sep 3, 2024 •

edited

Loading

parthsarthi03 commented Sep 3, 2024 •

edited

Loading

daniyal214 commented Sep 3, 2024

parthsarthi03 commented Sep 3, 2024

How to get the ONLY original documents' chunks? #54

How to get the ONLY original documents' chunks? #54

Comments

daniyal214 commented Sep 2, 2024 • edited Loading

daniyal214 commented Sep 3, 2024

parthsarthi03 commented Sep 3, 2024

daniyal214 commented Sep 3, 2024

parthsarthi03 commented Sep 3, 2024

daniyal214 commented Sep 3, 2024 • edited Loading

parthsarthi03 commented Sep 3, 2024 • edited Loading

daniyal214 commented Sep 3, 2024

parthsarthi03 commented Sep 3, 2024

daniyal214 commented Sep 2, 2024 •

edited

Loading

daniyal214 commented Sep 3, 2024 •

edited

Loading

parthsarthi03 commented Sep 3, 2024 •

edited

Loading