Added GRU to achieve video consistency #14
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
First of all, your work is amazing!! I just want to make it clear that I absolutely love this result together with matte anything. However, the main problem for real applications of this type of models is the temporal inconsistency. Since you are applying the model image-wise is impossible to achieve such temporal consistency for videos. This pull request is an attempt to include all the features that made RobustVideoMatting temporally consistent, so that you can easily retrain and see if you solve the temporal inconsistency problem.
The main change is the addition of convolutional GRUs in the detail capture mode. To make it possible to reuse already trained models, I add the ConvGRU layers similar to how it was done in controlnet, by initializing at zero and creating residual connections. This way, you can share a hidden state across frames and so the model can achieve temporal consistency. Nevertheless, that is not enough, I have also added another loss function that explicitly guides the model in achieving temporal consistency.
All the code is more or less recycled from the RobustVideoMatting repository. To not break anything I have duplicated the affected files and added a '_video' suffix. The code is supposed to be backward compatible except for working with 5D tensors instead of 4D. I tried to integrate it as much as possible so that you can rapidly try this idea. However, I am aware that the difficult part of managing the data is not included in this pull request. You would need to download RobustVideoMatting dataset and train on there.
I will be more than glad to help with any doubt or contribute further if you give me directions on the hardware you use or the environment. I really want this model to have temporal consistency so that it can be used in real world applications.