Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to add Metadata to index #15

Open
NMVRodrigues opened this issue Sep 11, 2024 · 4 comments
Open

Unable to add Metadata to index #15

NMVRodrigues opened this issue Sep 11, 2024 · 4 comments
Labels
bug Something isn't working ColPali Upstream

Comments

@NMVRodrigues
Copy link

NMVRodrigues commented Sep 11, 2024

When trying to add metadata to an index, either using a list of metadata dicts or a mapping of uid to metadata dict (shown below), it always produces a key error.

Example:

RAG = RAGMultiModalModel.from_pretrained("vidore/colpali", device='cuda:2', verbose=1)

# contains 29 1-page pdfs
files = glob(os.path.join('/dataset_pdf', '*.pdf'))

# generate simple unique ids
uids = list(range(len(files)))

# get the file names
report_ids = [file.split('/')[-1].split('.pdf')[0] for file in files]

metadata = {uids[i]: {'file_name':report_ids[i]} for i in range(len(uids))}

RAG.index(
    input_path='dataset_pdf',
    index_name='Documents', # index will be saved at index_root/index_name/
    doc_ids=uids,
    store_collection_with_index=True,
    overwrite=True,
    metadata=metadata,

)

This produces the following error:

report_ids = [file.split('/')[-1].split('.pdf')[0] for file in files]
metadata = {uids[i]: {'file_name':report_ids[i]} for i in range(len(uids))}
--> RAG.index(
input_path='dataset_pdf',
index_name='Documents', # index will be saved at index_root/index_name/
doc_ids=uids,
store_collection_with_index=True,
overwrite=True,
metadata=metadata,
)

File ~/miniconda3/envs/rag/lib/python3.9/site-packages/byaldi/RAGModel.py:111, in RAGMultiModalModel.index(self, input_path, index_name, doc_ids, store_collection_with_index, overwrite, metadata)
def index(
 self,
 input_path: Union[str, Path],
   (...)
...
-->  current_metadata = metadata[i] if metadata else None
 if current_doc_id in self.doc_ids:
 raise ValueError(f"Document ID {current_doc_id} already exists in the index")

KeyError: 0

Removing metadata solves this problem, however, it should be ok based on the metadata docstring from RAGMultiModalModel

@bclavie
Copy link
Contributor

bclavie commented Sep 16, 2024

Hey

Removing metadata solves this problem, however, it should be ok based on the metadata docstring from RAGMultiModalModel

It should be yes, this is my mistake. The docstring is about the ideal scenario, which I had to rollback to not hold back the release because of slight metadata issues. Currently it only works with a list of len(documents), but the proper dictionary support will be re-added. Flagging this as a bug as you're expecting a behaviour that should be here.

@bclavie bclavie added bug Something isn't working ColPali Upstream labels Sep 16, 2024
@vivekjoshi556
Copy link

vivekjoshi556 commented Nov 6, 2024

Hi @NMVRodrigues, @bclavie
I also had this issue. Each time I tried to adding the metadata field it kept giving me this error:

-->  current_metadata = metadata[i] if metadata else None
 if current_doc_id in self.doc_ids:
 raise ValueError(f"Document ID {current_doc_id} already exists in the index")

KeyError: 0

As a workaround for now sending each metadata as a list worked for me. So originally my metadata looked something like this:

items = [{"filename": "file1.pdf"}, {"filename": "file2.pdf"}, {"filename": "file3.pdf"}]

I changed to:

items = [[{"filename": "file1.pdf"}], [{"filename": "file2.pdf"}], [{"filename": "file3.pdf"}]]

Notice how I converted each metadata to a dictionary inside a list. It worked for me. Obviously, this is a work around and expectation would be the ideal situation as in the docs.

Let me know if you think there is something wrong with this.

@tma66
Copy link

tma66 commented Nov 7, 2024

Curious on the status for this. I noticed there was a fix out but hasn't been merged yet.

@bclavie
Copy link
Contributor

bclavie commented Nov 7, 2024

Curious on the status for this. I noticed there was a fix out but hasn't been merged yet.

I've been out for various reasons (vacation/illness), but there'll be a byaldi release tomorrow or Monday (in the worst case) incorporating lots of tiny fixes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ColPali Upstream
Projects
None yet
Development

No branches or pull requests

4 participants