Context-aware Synthesis for Video Frame Interpolation

Context-aware Synthesis for Video Frame Interpolation
Simon Niklaus and Feng Liu
IEEE Conference on Computer Vision and Pattern Recognition

Video frame interpolation algorithms typically estimate optical flow or its variations and then use it to guide the synthesis of an intermediate frame between two consecutive original frames. To handle challenges like occlusion, bidirectional flow between the two input frames is often estimated and used to warp and blend the input frames. However, how to effectively blend the two warped frames still remains a challenging problem.

This paper presents a context-aware synthesis approach that warps not only the input frames but also their pixel-wise contextual information and uses them to interpolate a high-quality intermediate frame. Specifically, we first use a pre-trained neural network to extract per-pixel contextual information for input frames. We then employ a state-of-the-art optical flow algorithm to estimate bidirectional flow between them and pre-warp both input frames and their context maps. Finally, unlike common approaches that blend the pre-warped frames, our method feeds them and their context maps to a video frame synthesis neural network to produce the interpolated frame in a context-aware fashion.

Our neural network is fully convolutional and is trained end to end. Our experiments show that our method can handle challenging scenarios such as occlusion and large motion and outperforms representative state-of-the-art approaches.

No paper is perfect and it is important to be upfront about issues once they become apparent. As such, I like to take the opportunity to mention some for this paper below.

  • The idea of extending color with features is quite simple yet powerful. It is thus not surprising that this technique has since been reinvented for other tasks such as neural rendering from Aliev et al., Bui et al., and Meshry et al. but they were not aware of our work and hence do not attribute it. This may be due to naming, the term feature-based rendering would have been more appropriate in hindsight.
  • The concurrent Super SloMo paper exemplifies how important naming is. The term slomo is something that everyone can easily grasp, video frame interpolation is not. As such, the Super SloMo paper did much better in the media even though it quantitatively did less favorably in the relevant Middlebury benchmark.

Please do not hesitate to reach out to me with questions and suggestions.