Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the cross-view aware Transformer? #79

Open
axbycc-mark opened this issue Nov 23, 2024 · 1 comment
Open

What is the cross-view aware Transformer? #79

axbycc-mark opened this issue Nov 23, 2024 · 1 comment

Comments

@axbycc-mark
Copy link

Hi mvsplat team. I have been reading through your paper here https://arxiv.org/abs/2403.14627 and I'm looking for more info about the cross view transformer blocks.

To construct the cost volumes, we first extract multi-view image features with a CNN and Transformer architecture. Specifically, a shallow ResNet-like CNN is first used to extract 4× downsampled per-view image features. Then, we use a multi-view Transformer with self and cross-attention layers to exchange information between different views.

Skimming through the code, it looks like this means you take each feature image (H, W, C) and turn it into a sequence of tokens by a simple reshape into (H*W, C) and then use a transformer model on these tokens, right? And the only other special thing that is happening is the generation of the shifted window masks for the "swin" style attention?

Thanks for the work.

@donydchen
Copy link
Owner

Hi @axbycc-mark, your understanding is correct. But note that the multi-view Transformer is not our contribution. We mainly adopted the architecture from a well-explored multi-view backbone introduced in GMFlow and UniMatch. Kindly refer to those two papers for related motivations and more implementation details. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants