You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I hope this message finds you well. I've been exploring the implementation of the CLIP model based on the paper 'Learning Transferable Visual Models From Natural Language Supervision'. In my review of the pseudocode provided in the paper, I noticed that the 'texts_loss' and 'images_loss' are calculated using binary matrices as targets, with values of 0 and 1. However, I observed that in the code available at this repository: https://github.com/moein-shariatnia/OpenAI-CLIP/blob/master/CLIP.py , the targets are computed by taking the softmax over the average of text similarity and image similarity.
I wanted to inquire about the rationale behind this difference in target calculation between the pseudocode in the paper and the implementation in your code. Could you kindly shed some light on why the targets are derived from the average of text and image similarities, followed by a softmax operation in the code?
I greatly appreciate your insights and understanding on this matter. I'm striving to grasp a better understanding of the implementation, and any clarification or references you could provide would be immensely helpful.
Thank you for your time and consideration.
The text was updated successfully, but these errors were encountered:
I have the same question.. It seems to me that the possible choice is that it allows the model to not penalize completely a caption that fits bot image a and image b which is similar. The same happen viceversa for text.
My question is, is this loss meant to be used just for finetuning clip? It seems to me that if the text encoder and image encoder are trained from scratch, the initial softmax will produce targets that are shared between all the images in the batch, preventing the learning of the correct association..
Hello,
I hope this message finds you well. I've been exploring the implementation of the CLIP model based on the paper 'Learning Transferable Visual Models From Natural Language Supervision'. In my review of the pseudocode provided in the paper, I noticed that the 'texts_loss' and 'images_loss' are calculated using binary matrices as targets, with values of 0 and 1. However, I observed that in the code available at this repository: https://github.com/moein-shariatnia/OpenAI-CLIP/blob/master/CLIP.py , the targets are computed by taking the softmax over the average of text similarity and image similarity.
I wanted to inquire about the rationale behind this difference in target calculation between the pseudocode in the paper and the implementation in your code. Could you kindly shed some light on why the targets are derived from the average of text and image similarities, followed by a softmax operation in the code?
I greatly appreciate your insights and understanding on this matter. I'm striving to grasp a better understanding of the implementation, and any clarification or references you could provide would be immensely helpful.
Thank you for your time and consideration.
The text was updated successfully, but these errors were encountered: