The main idea of this paper is 'motion-compensated frame interpolation can be implemented by a simple convolution operation'. The following image breifly shows the author's idea.
So the goal of this paper is to train fully convolutional neural network that infers a convolution kernel from the consecutive two video patches.
Applying the convolution kernel (infered by the trained deep network) to the patches P1 and P2 means simultaneous motion estimation and frame interpolation.
The authors proposed a convolutional neural network architecture that makes use of Batch Normalization and Relu.
They also proposed using absolute pixel difference (between ground-truth pixel value and the pixel value interpolated by the infered convolution kernel) as well as absolute gradient difference.
Incorporating the absolute gradient difference loss into the overall loss makes the interpolated frame be shaper.
Shift-and-stitch implementation method was also suggested by the authors in order to significantly decrease computation time.
The limitations of this work
1) Can not deal with video with very large motion (motion magnitude greater than 41 pixels)
2) Only produce one frame at time t+1/2 between the t-th and (t+1)-th video frame.