MPEG Compression

How MPEG Video Compression Works

A single frame of broadcast-quality video comprises 640 by 480 pixels, which, when multiplied by each pixel's 24 bits of color information, brings the total to 921 KB. A single-speed CD-ROM can only read data at a rate of 150 KB per second. That means that if you want to achieve 30 frames per second, you need to squeeze each 921 KB frame down to 5 KB (a compression ratio of almost 200:1). Since you also need room for other data such as text, voice and MIDI, the video data usually needs to be compressed by more than 200:1.

Video compression is composed of ten basic processes, six of which result in data reduction. The first four, filtering, color conversion, digitizing and scaling are essentially preprocessing steps. Though they are not considered compression, they can have a dramatic effect on data reduction. The processes that directly result in compression are data transforms, quantization, compaction encoding, interframe compression using motion compensation and predictive coding.

In MPEG, compression results from intraframe compression in which redundancies among pixels in the same video frame are exploited to reduce the amount of data and interframe compression in which similarities between pixels in adjacent frames are exploited to compress the data. In MPEG, most of the compression is due to interframe compression.

Filtering

Images which lack sharp edges are easier to compress than images which have sharp edges. Thus a preprocessing step in MPEG is to filter the image to reduce the sharp edges (i.e. high frequencies) by filtering. Filtering works by averaging neighboring pixels to produce a lower data rate (frequency). The number of pixels used in the average is called the number of taps. MPEG recommends using a seven tap filter.

Color-Space Conversion

Human eyes detect less detail in color changes than in black and white. Compression schemes take advantage of this lack of sensitivity to save significant space. Human eyes detect three primary colors: red, green and blue. Different colors in any image are combinations of differening amounts of red, green and blue. This is illustrated by forming a cube where each of the three colors lies along one axis of the color cube. For example, yellow is made up of red and green with no blue. The corner of the cube where R=G=B=0 is black and the diagonally opposite corner is white.

All points along the black-to-white diagonal of the cube are shades of gray or luminance, called Y. The eye is very sensitive to changes in luminance but not so sensitive to changes in color. Thus, instead of storing video information in terms of their red, green and blue components, two chrominance values called U and V and the luminance value, Y, are used instead (this is called the YUV representation). The YUV representation is simply a different set of coordinates which are used to represent the same point in color space. The U vector points approximately in the red direction while the V vector points approximately in the blue direction. Any color can be converted from its RGB representation to its YUV values using a simple matrix transformation. The advantage to this approach is that the Y value can be digitized at full resolution while the U and V can be digitized at lower resolutions without causing the eye to notice any reduction in information or detail.

This is a good point to define the three types of video used in PC video. When video is divided into its RGB components, it's called RGB video. When Y,U and V get separate cables, its called component video and when U and V are combined but Y is separate (two cables) its called S-video.

Digitization

When video is stored in YUV values, not as much detail is needed for the U and V chrominance values as for the Y luminance value. There are various different sampling schemes used for different compression methods and video formats. In studio quality video, the U and V components are sampled twice for every four Y samples (4:2:2 YUV). Therefore there are an average of 16 bits per pixel. In MPEG (and DVI), only one U and V sample is taken for each 2 by 2 square of luminance pixels (4:1:1 YUV). Thus in MPEG there are an average of 12 bits per pixel. In Indeo, only one U and V sample is taken for each 4-by-4 array of luminance pixels. Intel calls this YUV9 because there are an average of 9 bits per pixel. For MPEG, this 4:1:1 YUV sampling scheme results in 2::1 compression.

Scaling

The largest reduction in data occurs by scaling (subsampling) the image when it is not displayed in a full 640 X 480 pixel window or when a reduction in spatial resolution is tolerable. High-quality digitizers average, or filter, neighboring pixels and lines with multitap filters. The inverse of scaling is zooming the picture on playback. When crudely done, this simply doubles or quadraples pixels and lines producing strong pixelization and jaggies. In contrast, smart zooming interpolates the missing pixels by averaging their surroundings.

Transforms

The first step in pure compression is a data transform which converts the ordinary two-dimensional spatial representation of a picture (or video frame) into another set of dimensions (such as frequency) in which some of the data can be selectively eliminated without severely degrading the quality of the picture. In the discrete cosine transform (DCT) used in JPEG and MPEG, a Fourier-type transform converts the image (or video frame) to frequency space. Because most natural images contain very little high frequency information, most of the picture information is concentrated in the low frequency coefficients of the transformed data. Image transforms by themselves don't lead to compression because they are inherently lossless, that is, the inverse transform restores the original image. The value in using a transform in compression is obtained in the next step, quantization, which throws away most of the high frequency information, leaving a slightly "softer" picture. The objective is to throw away just enough information so that the eye doesn't notice the loss of the higher frequencies which contain the sharper details of the image.

The DCT works on 8 X 8 blocks of pixels, which results in 64 frequency coefficients that are arranged in an 8 X 8 square in frequency space as shown in the figure. The lowest frequency term (which is essentially the average value of the block) is at the upper left. The higher-frequency horizontal coefficients increase from left-to-right while the higher frequency vertical coefficients increase from top to bottom. For natural images, most the the energy in the block lies in the low frequency coefficients in the upper left-hand corner.

The Wavelet Transform is another type of transform used in video compression. The Wavelet Transform converts an image to multiple spatial images, one of which contains the high frequency horizontal information (horizontal edges), one of which contains the high frequency vertical information (vertical edges), one of which contains the high frequency diagonal information (diagonal edges) and other images which contain successively lower frequency information in the horizontal, vertical and diagonal directions. The Wavelet Transform keeps image information more localized, enabling more selective data reduction at the quantization step.

Quantization

The heart of lossy compression is the quantization step in which the transformed coefficients (which can be considered as continuous floating point numbers) are converted into representations using smaller numbers of bits and therefore are represented with larger "quantum" steps.

In MPEG and JPEG, the low frequency DCT coefficients are quantized using approximately four bits meaning they have 16 different levels. The higher frequency coefficients are quantized with fewer bits, usually, 2,1, or zero bits. This means that many of the high frequency coefficients are converted to zero which causes them to be eliminated. MPEG and JPEG enable one to define the quantization table (Q table), i.e. how many bits are used for each of the coefficients in the 8X8 block. If a codec chooses its own Q table based on an anlysis of the image, its called adaptive compression.

Compaction Encoding

The final step in intraframe compression is a lossless process of data compaction similar to that used to compress data files. Codecs use three types of compaction schemes. The first is run-length encoding which replaces consecutive identical digits (usually zeros) with the number of consecutive digits and their value. The second is Huffman coding where values are replaced by a code based on their frequency of occurrance. Frequently occurring values are given shorter codes which results in compression. The third type of compaction is arithmetic coding which operates on somewhat the same principle as Huffman Coding but is more efficient.

Interframe Compression

The greatest amount of data redundancy in video is not the neighboring pixels in a single frame, it's the vast number of similar pixels in successive frames that do not change or change very slowly. By predicting that a pixel will remain the same, and then encoding the difference, you end up with a string of zeroes or very small numbers suitable for efficient run-length encoding.

MPEG looks at a frame that has already been compressed using intraframe techniques (called an I-frame) and predicts a frame or picture, called a P-frame, which can be the next frame, two or four frames later. Interpolated frames, called B-frames, fill in the skipped frames.

MPEG breaks the I-picture into 16 X 16 pixel blocks and assumes the block is the same in the predicted P-frame, say four frames later. If the pixels in the video are moving, then the block will not be the same, and MPEG subtracts the actual picture block from the prediction to produce a new picture that is comprised of the set of differences called an error signal. This "differences" picture is zero when the picture doesn't change. MPEG then applies intraframe JPEG compression to the differences picture in 8 X 8 pixel blocks. If the picture has changed considerably, MPEG invokes more powerful motion estimation techniques to see if a macroblock has moved just slightly. If motion compensation fails, MPEG starts over and recodes the macroblock with interframe coding. If a P-picture is completely different from the preceeding I-picture (so that it's unpredictable), it becomes an entirely new I-picture. Normally, P-pictures are used to predict other P-pictures several times before MPEG encodes another I-picture.

When pixels change, its often due to predictable camera or subject motion which causes the macroblocks appear in the new image in a slightly different position. MPEG uses a complex trial-and-error procedure to search for macroblocks that have moved. MPEG's motion estimation procedure calculates a motion vector (vertical and horizontal pixel shift) which is used to obtain a better prediction for that macroblock in the next frame.

Bidirectional prediction in MPEG interpolates pictures by predicting them not only from earlier I or P frames but also from later I or P frames. The later frames must be transmitted out of order to make this trick possible. The easiest and fastest way to interpolate a picture is to simply average earlier and later frames. However, to achieve greater compresion, MPEG uses bidirectional interpolation. A B-picture contains macroblocks which have been successfully predicted from a preceeding I or P picture, a following P or I picture, both a preceeding and following (by averaging) or by neither, in which case that macroblock is intraframe coding. When a prediction is successful, a small error signal is the result which is itself coded using intraframe coding.

The MPEG standard allows as many as three B-pictures in a row, the number of I-pictures is typically two per second which means that P-pictures are used to forward predict two-to-five following P-pictures before another I picture is coded.