Video frame interpolation typically involves two steps: motion estimation and pixel synthesis. Such a two-step approach heavily depends on the quality of motion estimation.
This paper presents a robust video frame interpolation method that combines these two steps into a single process. Specifically, our method considers pixel synthesis for the interpolated frame as local convolution over two input frames. The convolution kernel captures both the local motion between the input frames and the coefficients for pixel synthesis. Our method employs a deep fully convolutional neural network to estimate a spatially-adaptive convolution kernel for each pixel. This deep neural network can be directly trained end to end using widely available video data without any difficult-to-obtain ground-truth data like optical flow.
Our experiments show that the formulation of video interpolation as a single convolution process allows our method to gracefully handle challenges like occlusion, blur, and abrupt brightness change and enables high-quality video frame interpolation.
No paper is perfect and it is important to be upfront about issues once they become apparent. As such, I like to take the opportunity to mention some for this paper below.
- We were not aware of the Filter Flow paper from Seitz and Baker at the time, but it would have been important to discuss it in the related work section.
- We pad the input images when inserting them into the network to make sure that the ouput size is not affected by the adaptive convolutions. However, it is perfectly fine to only pad the input images when applying the adaptive convolutions which is more efficient and leads to slightly better results.
- We would also like to point to Dynamic Filter Networks, concurrent work that like our proposed approach predicts spatially varying filter kernels.
Please do not hesitate to reach out to me with questions and suggestions.