Softmax Splatting for Video Frame Interpolation
Simon Niklaus and Feng Liu
IEEE Conference on Computer Vision and Pattern Recognition
Differentiable image sampling in the form of backward warping has seen broad adoption in tasks like depth estimation and optical flow prediction. In contrast, how to perform forward warping has seen less attention, partly due to additional challenges such as resolving the conflict of mapping multiple pixels to the same target location in a differentiable way.
We propose softmax splatting to address this paradigm shift and show its effectiveness on the application of frame interpolation. Specifically, given two input frames, we forward-warp the frames and their feature pyramid representations based on an optical flow estimate using softmax splatting. In doing so, the softmax splatting seamlessly handles cases where multiple source pixels map to the same target location. We then use a synthesis network to predict the interpolation result from the warped representations. Our softmax splatting allows us to not only interpolate frames at an arbitrary time but also to fine tune the feature pyramid and optical flow.
We show that our frame synthesis approach, empowered by softmax splatting, achieves new state-of-the-art results for video frame interpolation.
No paper is perfect and it is important to be upfront about issues once they become apparent. As such, I like to take the opportunity to mention some for this paper below.
- The paper stated "-sign" in Equation 10 but should have stated "sign" instead. We would like to thank Sen Zhang for kindly pointing this out. Fortunately, our implementation is/was correct so our story remains unchanged.
- It is well known that splatting is susceptible to a slight blurriness. We have failed to discuss this in our paper but should have done so. However, our approach of warping features and then using a synthesis network to obtain the interpolation result makes it possible to reintroduce high-frequency components.
- One could try to make linear splatting translational invariant by subtracting the minimum of Z from itself before using it. While this approach works in principle, it makes the splatting subject to changes in the minimum of Z which may lead to inconsistent results if Z is noisy.
Please do not hesitate to reach out to me with questions and suggestions.