Inter Prediction: skip, MV, Affine, new VVC tech

Inter Prediction (or temporal prediction) - is a type of prediction process that uses other frames as references for predicting the current frame samples. Reference frames could be past or future in playback order, but they must be previously processed in decode order.

The main idea of inter prediction is that sample blocks in the current frame could be copied from similar blocks of the other frames with an offset called motion vector (MV). It is based on the assumption that objects usually stay visible for some time in a video sequence, though they could move.

The encoder performs motion estimation and signals the motion vector to the bitstream. The decoder uses the motion vector as an instruction for applying inter prediction.

The efficiency of video coding algorithms highly relies on inter prediction because it reduces temporal redundancy.

Motion vectors could be singled explicitly, but encoding a motion vector for each block can take a significant number of bits. Usually, MVs for neighboring partitions are highly correlated, so it’s possible to make motion vector prediction (MVP) based on nearby already calculated MVs. There are some additional signaling types for Inter prediction used to reduce encoded data: skip, direct, merge. For these modes, the motion information including motion vector, inter direction, the reference index of the current coding block is completely inherited from its spatial adjacent or temporal co-located coded blocks. The differences are:

Skip mode – MV is predicted; residual is not transmitted – reconstructed block is the same as predicted (presented in H.264)

Direct mode - MV is predicted; residual is transmitted (presented in H.264)

Merge mode - MV is selected from MV candidates, candidate index is transmitted, residual is transmitted (presented in H.265)

Here are some additional modes presented in VVC (H.266):

Geometric Partitioning (GEO) – block is divided into two parts by a signaled angle. Each part has a separate motion vector.

History-based motion vector prediction (HMVP) – during the decoding process MVs are stored in the history list and then could be used as candidates

Subblock predictions – data is signaled for the whole block, but the prediction is done separately for small subblocks. Includes:

Affine Prediction – represents affine types of motions: rotate, resize, shear. Described with 2 or 3 control points motion vectors. CPMVs are used to calculate independent MV for each subblock.
Subblock-based Temporal Motion Vector Prediction (SbTMVP) – uses motion information on collocated pictures to derive motion vectors for subblocks.

Enhancement modes:

Bi-directional optical flow (BDOF) – pixel-wise enhancement-mode based on optical flow equation
Decoder-Side Motion Vector Refinement (DMVR) – decoder refines motion for subblocks by minimizing error distortion between references