5.1 Introduction
5.2.7 Other Work
Temporal interpolation of signals is not new, it has been done for a long time for 1D signals in signal processing, but these methods cannot be applied to our problem as we have motion to consider. If we compute the flow of a sequence it should just be possible to take a simple average along the backward and forward flows to create new frames. This is however an oversimplification of the problem as we do not know what comes first in a new frame: the flow or the intensities? To optimally solve this hen-egg problem one has to iterate between estimating the flow and the intensities as we suggested above. This idea of simultaneous multiresolution flow and intensity calculation has already been proven itself useful for motion compensated inpainting in [65]. Let us have a look at what others have done before us in the field of temporal super resolution.
In medical imaging interpolation of new frames or volumes of a time se- quence of 2D or 3D scans are of interest, mainly in lung (respiratory gated) and heart (heart gated) imaging. The work by Ehrhardt et al. in [35] is a typical and recent example, where temporal super resolution in heart gated imaging is performed using an accurate flow algorithm, but doing simple motion compen- sated interpolation of intensities along the flow lines to get the new frames. In our own field of video processing there are several TSR patents like the one by Cornog et al. [24] and the already mentioned [85] where the same procedure as in [35] is used: Flow calculation (good or bad) followed by some non-iterative averaging along the flow. TSR is also done in integrated circuits (ICs) as de- scribed by de Haan in [28] using 8 × 8 block matching flow with a median filter for motion compensated interpolation (details on the intensity interpolations is given in [78] by Ojo and de Haan). In a recent paper [27] by Dane and Nguyen motion compensated interpolation with adaptive weighing to minimize the error from imprecise or unreliable flow is presented, which is surely needed as the flow used in [27] is the MPEG coding vectors (typically prediction error minimizing vectors). The advantage of using MPEG vectors as flow is that one avoids the computationally expensive flow calculation.
In [52] Karim et al. focus on improving block matching flow estimation for motion compensated interpolation in low frame rate video and no less then 16 references to other TSR algorithms are given. An overview of early work on mo- tion compensated temporal interpolation in general (TSR, coding, deinterlacing etc.) is given by Dubois and Konrad in [33] where they state that even though motion trajectories are often nonlinear, accelerated and complex, a simple lin- ear flow model will suffice in many cases. In [17] Chahine and Konrad shows
that modelling the acceleration will improve results in motion compensated TSR when measuring the picture signal to noise ratio (PSNR) between the results and the ground truth. The improved flow modelling will make interpolation (and prediction) coding better in terms of quality to bandwidth ratio. We are not necessarily interested in optimizing an objective error measure like the PSNR, but are more focussed on pleasing the human viewer. Variational optical flow algorithms can model and calculate acceleration as a consequence of temporal regularization. An example of this is the flow of the sequence Ettlinger Tor computed by Brox et al. in [9], but in interpolating flow and intensities in all new frames, a straight motion trajectory is likely to be chosen as the flow of minimum energy (variational optical flows are typically found by minimizing the energy of the model applied to the image sequence and its optical flow). Specif- ically modelling acceleration in the flow will give a smoother flow trajectory in TSR. Whether this increased smoothness of the flow will improve the quality of TSR is doubtful as the human visual system itself does linear interpolation.
A problem somewhat more complex than our new frame interpolation prob- lem is trying to create a new arbitrary viewpoint 2D sequence from a multi- camera recording of a scene as done by Vedula et al. in [105]. The 3D registra- tion and scene modelling (similar to structure from motion problems) is what complicates matters, but the multicamera recordings does at the same time provide you with an abundance of available information. Leaving out all the 3+1D shape and flow modelling, the 2+1D TSR part used in [105] is the classic Lucas-Kanade flow estimation [71] followed by simple motion compensated in- tensity interpolation using weighing by linear distance in time from forward and backward known frame neighbors. No optical flow estimation is performed by Shechtman et al. in [92] and in their version of the multiple camera approach, all the cameras are assumed to be close spatially or the scene assumed planar to allow the different input sequences to be registered by simple alignment to the common frame of reference. From the multiple inputs a high resolution output in either space or time – it is a tradeoff – is produced. Shechtman et
al. therefore already have a very dense recording of a registered scene making it
very easy (according to the authors) to produce high frame rate TSR outputs without flow estimation (e.g. 75 fps from four 25 fps sequences). This technique can not be used on single camera recordings and the view point can no longer be chosen arbitrarily as it could in [105].
Using patches to represent salient image information is well-known (see e.g. the papers by Freeman et al. [40] and Griffin and Lillholm [44]) and an exten- sion to spatiotemporal image sequences under the name of ’video epitomes’ is presented and used for TSR by Cheung et al. in [21]. The framework for video epitomes can just as our Bayesian inference based framework be used in general for image sequence inpainting, upscaling and de-noising. In the case of TSR it is not at all discussed in [21] to which degree video epitome TSR can handle motion and in the example given on generating frames dropped unevenly in an Internet broadcast there is only very little motion. Furthermore video epito- mes need to be learned on either a better representation of the degraded data (e.g. high resolution sequences) or on the input data available. Learning is a computationally very costly process (no details given in [21]) and it is therefore unclear whether video epitome TSR can be applied to general video at all (and at what processing cost in time and hardware) but video epitomes is as such an interesting technology.