Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the currently code primarily targeted at front-view image? #16

Open
sunbin1357 opened this issue Dec 25, 2024 · 1 comment
Open

Comments

@sunbin1357
Copy link

sunbin1357 commented Dec 25, 2024

Thanks for your great work.

The senna_nusc_data_converter part only adds \n<image>at the end, without the surround-view prompt mentioned in the paper, <FRONT VIEW>:\n<image>\n. Therefore, during training, although it generates surround-view image_features, the num_images = (cur_input_ids == IMAGE_TOKEN_INDEX).sum() is always 1, so only the front-view image_features are used. Is the currently code primarily targeted at front-view image?

Additionally, where in the code is the multi-head self-attention part from the figure below?
image

@rb93dett
Copy link
Collaborator

Hi, the current code is for multi-image input. I have updated the data generation code to include multi-view prompts, this is a bug, thank you for pointing it out.

Regarding your question, in the paper, we implemented image token compression using Q-Former. In subsequent experiments, we found that using MLP yielded similar results and converged faster. Therefore, the current code uses MLP instead of Q-Former. See here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants