phpMyVisites

Introduction to H.264 video coding standard



Table des matières


1. Introduction to the H.264 standard


Studies on H.264 began within the Video Coding Expert Group (VCEG) of ITU-T in 1999. The objectives of this standard were:

- to provide efficient compression (a reduction of about 50% for the average bitrate at equivalent visual quality when compared to the other existing standards),
- to present a reasonable ratio between complexity and coding efficiency,
- to be easily adaptable to networked applications, and in particular wireless networks and internet (hence over IP)
- to adopt a syntax easy to use, with a reduced number of profiles and options.

H.264 main target applications are:

- duplex real-time voice services (e.g. visiophony) over wired or wireless (such as UMTS) networks (bitrate below 1 Mb/s and low latency),
- good or high quality video services for streaming over satellite, xDSL, or DVD (bitrate from 1 to 8 Mb/s and possibly large latency),
- lower quality streaming for video services with lower bitrate such as over Internet (bitrate below 2 Mb/s and possibly large latency).

The first test model (TML-1) was delivered in August 1999. In December 2001, VCEG and MPEG decided to create a partnership and created the "Joint Video Team" (JVT) to establish a common standard, labelled H.264 or MPEG-4 AVC (advanced video coding). First drafts were proposed to ITU and ISO in 2002 and the final version of ITU document (JVT-G050) was establish in March 2003.

H.264 specifies only the video coding aspects, while the transport problematic is dealt by MPEG-4 system specifications. Still, as illustrated in Figure 1, H.264 includes above its coding layer a flexible adaptation layer that allow it to be compatible with transport technologies of both wireless and wired worlds:

- for phone networks through H.324 (circuit mode) or H.320 (fixed networks);
- IP world through RTP/UDP/IP or TCP/IP stacks.

This makes of H.264 a standard which is compatible with both mobile and fixed solutions, and can hence be a way to help the already beginning unification of those two worlds.


Figure 1 - H.264/AVC layer structure.

It is to be noted that while being now part of MPEG-4 standard, H.264 is not directly compatible with the other MPEG-4 video parts, as the compression tools greatly differ. While not being completely different from the other existing video standards, the video coding layer of H.264 is characterised by the following new key aspects:

- better motion compensation,
- smaller blocks in the coding transform,
- a new coding transform (inversible),
- improved deblocking filter
- more efficient entropy coding

This is leading to a noticeable bitrate reduction for equal quality when compared to the other standards, as illustrated in Table 1.

Mix CIF/QCIF

(4 QCIF sequences at 10 and 15 fr/s
and 4 CIF sequences at 15 and 30 frs/s).
Reasonable motion, no B frames.

Average bitrate gain when compared to:

MPEG-4 ASP

(Advanced Simple Profile)

H.263 CSC

(Conversational High Compression)

MPEG-4 SP

(Simple Profile)

H.263 Baseline

H.264 / AVC

28%

32%

34%

45%


Table 1 - Average gain in bitrate with H.264 over other standards and modes (source: JVT).

It is however to be noted that this gain has also a cost: H.264 baseline encoder (resp. decoder) is estimated to be four or five times (resp. two or three times) more complex than an MPEG-2 encoder (resp. decoder).

H.264 defines three profiles

- Baseline, which is particularly adapted to videoconferencing, video over IP and mobility applications. It implements only I and P frames, and a few error protection tools,
- Main, which is adapted to TV and video broadcasting and applications with large latency. It implements in particular interleaving mode, B frames, arithmetic entropy coding,
- eXtended, which is adapted to the streaming over various different channels, in particular wireless channels. It implements in particular adaptive rate solutions and error protection tools.

It is to be noted that the Extended profile contains the Baseline one, which is not the case of the Main profile, as illustrated by Figure 2, where the main tools implemented by each profile is indicated.


Figure 2 - H.264 profiles.

In terms of evolution and insertion in the market world, it seems that for now most of the implementations are using Baseline profile. Still considered as under development more than a year ago, H.264 is appearing since then and in particular since IBC'2003 in Amsterdam (Sept. 2003) as a technology that must be counted with. Proof of use were made by various constructors and operators interested by the standard last year: BT Research labs presented an H.264 decoder on a Nokia mobile 7650 with QCIF at about 20 kbit/s with 14 fr/s. LSI Logic presented an encoder Main@Level 3 assembling FPGA and Pentiums, a real-time baseline encoder was proposed by Vanguard Software Solutions working on a simple Pentium IV PC.

It is undeniable that H.264 is now surfing on the newly developing wave of video over wireless and streaming over Internet, and that it will soon be a real candidate to be taken into account by existing standards, if patent issues do not slow its rise down as they did for MPEG-4.

A drawback of H.264 is that it does not include scalability, except for temporal scalability (an illustration of temporal scalability is presented in Figure 3) offered by the use of Bi-directional (B) frames in eXtended or Main profiles. This could however change, as the forum MPEG-21 scalable video coding (SVC) is considering, among other solutions, a scalable extension to H.264/AVC.


Figure 3 - Illustration of temporal scalability.

2. Description of the H.264 Video Coding Format


H.264 standard relies on the global coding scheme presented in Figure 4.


Figure 4 - H.264 global coding scheme.

Let us focus on a few elements of this coding scheme, in particular on those which present novelties when compared to traditional predictive coding standards. First, one finds the separation in macro-blocks and then blocks: each image is sub-divided in 16x16 pixels macro-blocks, which are themselves divided in 4x4 pixels blocks, as illustrated by Figure 5. Then one finds then the used transform, which is no more a DCT but an integer transform called H which is an approximation of the DCT and presents the advantage to be invertible without loss. This transform is always applied to 4x4 blocks and is given its matrix H:


Figure 5 - Separation in macro-block/block in H.264 standard.

Following the H transform, one finds a uniform scalar quantization. A deblocking filter was then introduced that allow to combat the blocking effect due to the transform and quantification, as illustrated by Figure 6.


Figure 6 - Effect of the deblocking filter in H.264 coding scheme.

Another very important innovation introduced by JVT team is the fact that even intra blocks use prediction, called Intra prediction, which relies on the adjacent blocks of the Intra image. Two different prediction modes are defined, with different precision level: the first one is the mode 16x16, which is in consequence used for lower bitrates and areas of the image with less details, whereas the mode 4x4 is more used in detailed areas and for higher bitrates. The prediction directions proposed by each mode are presented in Figure 7.


Figure 7 - Existing intra prediction modes in H.264 standard.

The Motion Estimation block contains also a new feature, in the sense were the precision of the estimation can vary in the image, as illustrated by Figure 8.


Figure 8 - Example of motion estimation in H.264, with a variable precision.

Finally comes the entropy coding that introduces also several original ideas. Two main modes have been defined: the first and more classical one is the CAVLC mode, for Context-based Adaptive Variable-Length Coding. It relies on traditional variable length codes (VLC) such as the Exponential-Golomb one presented in Table 2. The second mode relies on arithmetic coding, such as JPEG 2000 standard, and is called CABAC mode, for Context-based Adaptive Binary Arithmetic Coding. In practice, it was observed that CABAC offered better compression performances than CAVLC but for a greater complexity. As a consequence, CABAC is only part of Main profile, whereas CAVLC is used for all profiles, and in particular the eXtended profile which is adapted to the streaming over various different channels, in particular wireless channels.

Code number

Codewords

0

1

1

0   1   0

2

0   1   1

3

0 0   1   0 0

4

0 0   1   0 1

5

0 0   1   1 0

6

0 0   1   1 1

7

0 0 0   1   0 0 0

8

0 0 0   1   0 0 1

9

0 0 0   1   0 1 0

10

0 0 0   1   0 1 1

11

0 0 0   1   1 0 0

 

prefix bits   .   suffix bits


Table 2 - Example of entropy code: Exponential Golomb (CAVLC mode).

The resulting syntax of the H.264 coding standard is then obtained by following the scheme presented in Figure 9 in the case of an Intra frame, which shows all the operations that are carried out by the standard to create the compressed bitstream.


Figure 9 - Generating H.264 codestream.

3. Dependencies inside the H.264 Video Coding Format


H.264 standard presents a new notion when compared to MPEG-4, in the sense where I frames consists of frames fully encoded in Intra mode, but with the possibility for later frames to take as a reference a frame appearing before the I. "Classical" I frames in the sense of fully intra coded frames that moreover corresponding to a complete flush of the reference buffers are called IDR frames.

Another difference introduced by H.264 standards is what is called the Multi-reference mode, which is illustrated by Figure 10. Contrarily to previous video coding standards that were using simple reference mode that is to say where the P prediction could only be done with respect to a given preceding picture and a B prediction with respect to one P or I frame in each direction, H.264 allows to use up to 32 different frames as reference for each P-slice and up to 64 different frames for each B-Slice. The pictures that are encoded or decoded and available for reference are stored in the Decoded Picture Buffer (DPB). In practice, they are marked either as a short term reference picture, indexed according to PicOrderCount, or as a long term reference picture, indexed according to LongTermPicNum. When the DPB is "full", only the oldest short term picture is removed. Long term are not removed except by an explicit command in the bitstream.


Figure 10 - Multi-reference principle in H.264 standard.