You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The senna_nusc_data_converter part only adds \n<image>at the end, without the surround-view prompt mentioned in the paper, <FRONT VIEW>:\n<image>\n. Therefore, during training, although it generates surround-view image_features, the num_images = (cur_input_ids == IMAGE_TOKEN_INDEX).sum() is always 1, so only the front-view image_features are used. Is the currently code primarily targeted at front-view image?
Additionally, where in the code is the multi-head self-attention part from the figure below?
The text was updated successfully, but these errors were encountered:
Hi, the current code is for multi-image input. I have updated the data generation code to include multi-view prompts, this is a bug, thank you for pointing it out.
Regarding your question, in the paper, we implemented image token compression using Q-Former. In subsequent experiments, we found that using MLP yielded similar results and converged faster. Therefore, the current code uses MLP instead of Q-Former. See here.
Thanks for your great work.
The senna_nusc_data_converter part only adds
\n<image>
at the end, without the surround-view prompt mentioned in the paper,<FRONT VIEW>:\n<image>\n
. Therefore, during training, although it generates surround-view image_features, thenum_images = (cur_input_ids == IMAGE_TOKEN_INDEX).sum()
is always 1, so only the front-view image_features are used. Is the currently code primarily targeted at front-view image?Additionally, where in the code is the multi-head self-attention part from the figure below?
The text was updated successfully, but these errors were encountered: