Is the currently code primarily targeted at front-view image？ #16

sunbin1357 · 2024-12-25T07:47:27Z

Thanks for your great work.

The senna_nusc_data_converter part only adds \n<image>at the end, without the surround-view prompt mentioned in the paper, <FRONT VIEW>:\n<image>\n. Therefore, during training, although it generates surround-view image_features, the num_images = (cur_input_ids == IMAGE_TOKEN_INDEX).sum() is always 1, so only the front-view image_features are used. Is the currently code primarily targeted at front-view image？

Additionally, where in the code is the multi-head self-attention part from the figure below?

The text was updated successfully, but these errors were encountered:

rb93dett · 2024-12-26T07:46:46Z

Hi, the current code is for multi-image input. I have updated the data generation code to include multi-view prompts, this is a bug, thank you for pointing it out.

Regarding your question, in the paper, we implemented image token compression using Q-Former. In subsequent experiments, we found that using MLP yielded similar results and converged faster. Therefore, the current code uses MLP instead of Q-Former. See here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is the currently code primarily targeted at front-view image？ #16

Is the currently code primarily targeted at front-view image？ #16

sunbin1357 commented Dec 25, 2024 •

edited

Loading

rb93dett commented Dec 26, 2024

Is the currently code primarily targeted at front-view image？ #16

Is the currently code primarily targeted at front-view image？ #16

Comments

sunbin1357 commented Dec 25, 2024 • edited Loading

rb93dett commented Dec 26, 2024

sunbin1357 commented Dec 25, 2024 •

edited

Loading