Studies on H.264 began within the Video Coding Expert Group (VCEG) of ITU-T in 1999. The objectives of this standard were:
- to provide efficient compression (a reduction of about 50% for the average bitrate at equivalent visual quality when compared to the other existing standards),H.264 main target applications are:
- duplex real-time voice services (e.g. visiophony) over wired or wireless (such as UMTS) networks (bitrate below 1 Mb/s and low latency),The first test model (TML-1) was delivered in August 1999. In December 2001, VCEG and MPEG decided to create a partnership and created the "Joint Video Team" (JVT) to establish a common standard, labelled H.264 or MPEG-4 AVC (advanced video coding). First drafts were proposed to ITU and ISO in 2002 and the final version of ITU document (JVT-G050) was establish in March 2003.
H.264 specifies only the video coding aspects, while the transport problematic is dealt by MPEG-4 system specifications. Still, as illustrated in Figure 1, H.264 includes above its coding layer a flexible adaptation layer that allow it to be compatible with transport technologies of both wireless and wired worlds:
- for phone networks through H.324 (circuit mode) or H.320 (fixed networks);This makes of H.264 a standard which is compatible with both mobile and fixed solutions, and can hence be a way to help the already beginning unification of those two worlds.
It is to be noted that while being now part of MPEG-4 standard, H.264 is not directly compatible with the other MPEG-4 video parts, as the compression tools greatly differ. While not being completely different from the other existing video standards, the video coding layer of H.264 is characterised by the following new key aspects:
- better motion compensation,This is leading to a noticeable bitrate reduction for equal quality when compared to the other standards, as illustrated in Table 1.
Mix CIF/QCIF (4 QCIF sequences at 10 and 15 fr/s |
Average bitrate gain when compared to: |
|||
MPEG-4 ASP (Advanced Simple Profile) |
H.263 CSC (Conversational High Compression) |
MPEG-4 SP (Simple Profile) |
H.263 Baseline |
|
H.264 / AVC |
28% |
32% |
34% |
45% |
It is however to be noted that this gain has also a cost: H.264 baseline encoder (resp. decoder) is estimated to be four or five times (resp. two or three times) more complex than an MPEG-2 encoder (resp. decoder).
H.264 defines three profiles
- Baseline, which is particularly adapted to videoconferencing, video over IP and mobility applications. It implements only I and P frames, and a few error protection tools,It is to be noted that the Extended profile contains the Baseline one, which is not the case of the Main profile, as illustrated by Figure 2, where the main tools implemented by each profile is indicated.
In terms of evolution and insertion in the market world, it seems that for now most of the implementations are using Baseline profile. Still considered as under development more than a year ago, H.264 is appearing since then and in particular since IBC'2003 in Amsterdam (Sept. 2003) as a technology that must be counted with. Proof of use were made by various constructors and operators interested by the standard last year: BT Research labs presented an H.264 decoder on a Nokia mobile 7650 with QCIF at about 20 kbit/s with 14 fr/s. LSI Logic presented an encoder Main@Level 3 assembling FPGA and Pentiums, a real-time baseline encoder was proposed by Vanguard Software Solutions working on a simple Pentium IV PC.
It is undeniable that H.264 is now surfing on the newly developing wave of video over wireless and streaming over Internet, and that it will soon be a real candidate to be taken into account by existing standards, if patent issues do not slow its rise down as they did for MPEG-4.
A drawback of H.264 is that it does not include scalability, except for temporal scalability (an illustration of temporal scalability is presented in Figure 3) offered by the use of Bi-directional (B) frames in eXtended or Main profiles. This could however change, as the forum MPEG-21 scalable video coding (SVC) is considering, among other solutions, a scalable extension to H.264/AVC.
H.264 standard relies on the global coding scheme presented in Figure 4.
Let us focus on a few elements of this coding scheme, in particular on those which present novelties when compared to traditional predictive coding standards. First, one finds the separation in macro-blocks and then blocks: each image is sub-divided in 16x16 pixels macro-blocks, which are themselves divided in 4x4 pixels blocks, as illustrated by Figure 5. Then one finds then the used transform, which is no more a DCT but an integer transform called H which is an approximation of the DCT and presents the advantage to be invertible without loss. This transform is always applied to 4x4 blocks and is given its matrix H:
Following the H transform, one finds a uniform scalar quantization. A deblocking filter was then introduced that allow to combat the blocking effect due to the transform and quantification, as illustrated by Figure 6.
Another very important innovation introduced by JVT team is the fact that even intra blocks use prediction, called Intra prediction, which relies on the adjacent blocks of the Intra image. Two different prediction modes are defined, with different precision level: the first one is the mode 16x16, which is in consequence used for lower bitrates and areas of the image with less details, whereas the mode 4x4 is more used in detailed areas and for higher bitrates. The prediction directions proposed by each mode are presented in Figure 7.
The Motion Estimation block contains also a new feature, in the sense were the precision of the estimation can vary in the image, as illustrated by Figure 8.
Finally comes the entropy coding that introduces also several original ideas. Two main modes have been defined: the first and more classical one is the CAVLC mode, for Context-based Adaptive Variable-Length Coding. It relies on traditional variable length codes (VLC) such as the Exponential-Golomb one presented in Table 2. The second mode relies on arithmetic coding, such as JPEG 2000 standard, and is called CABAC mode, for Context-based Adaptive Binary Arithmetic Coding. In practice, it was observed that CABAC offered better compression performances than CAVLC but for a greater complexity. As a consequence, CABAC is only part of Main profile, whereas CAVLC is used for all profiles, and in particular the eXtended profile which is adapted to the streaming over various different channels, in particular wireless channels.
Code number |
Codewords |
0 |
1 |
1 |
0 1 0 |
2 |
0 1 1 |
3 |
0 0 1 0 0 |
4 |
0 0 1 0 1 |
5 |
0 0 1 1 0 |
6 |
0 0 1 1 1 |
7 |
0 0 0 1 0 0 0 |
8 |
0 0 0 1 0 0 1 |
9 |
0 0 0 1 0 1 0 |
10 |
0 0 0 1 0 1 1 |
11 |
0 0 0 1 1 0 0 |
|
prefix bits . suffix bits |
The resulting syntax of the H.264 coding standard is then obtained by following the scheme presented in Figure 9 in the case of an Intra frame, which shows all the operations that are carried out by the standard to create the compressed bitstream.
H.264 standard presents a new notion when compared to MPEG-4, in the sense where I frames consists of frames fully encoded in Intra mode, but with the possibility for later frames to take as a reference a frame appearing before the I. "Classical" I frames in the sense of fully intra coded frames that moreover corresponding to a complete flush of the reference buffers are called IDR frames.
Another difference introduced by H.264 standards is what is called the Multi-reference mode, which is illustrated by Figure 10. Contrarily to previous video coding standards that were using simple reference mode that is to say where the P prediction could only be done with respect to a given preceding picture and a B prediction with respect to one P or I frame in each direction, H.264 allows to use up to 32 different frames as reference for each P-slice and up to 64 different frames for each B-Slice. The pictures that are encoded or decoded and available for reference are stored in the Decoded Picture Buffer (DPB). In practice, they are marked either as a short term reference picture, indexed according to PicOrderCount, or as a long term reference picture, indexed according to LongTermPicNum. When the DPB is "full", only the oldest short term picture is removed. Long term are not removed except by an explicit command in the bitstream.