| Oversampling | |
| How to Implement a Proper Video Scaler (Interpolator) | |
| Motion Vector Steering |
©1998 John Watkinson. All rights reserved.
DISCLAIMER: This material is copyrighted by its author. The information contained in this document does not necessarily represent the views of Microsoft Corporation.
People seem to think that high definition television needs lots of lines, but it's a myth. Cameras and displays need a lot of lines to overcome aperture effects and to render the raster invisible, but the transmission medium between doesn't. In the early days of television, the capture, transmission, and display formats had to be identical for simplicity, but that's no longer true or desirable.
A 480 line camera can't give 480 lines of resolution, but a 960 line camera with downconversion can. Effectively, the camera is using oversampling. Although oversampling has totally dominated digital audio because of its obvious merits, it is harder to use it in conventional television because of interlace. Interlace puts half the picture data at another time and reduces the performance of spatial resamplers. Once interlace is dispensed with, oversampling becomes an obvious and attractive technology.
Oversampling overcomes practical limits in optical filters. In a CCD camera, the sensor elements sample the image spatially. The sensors are large for maximum light sensitivity and so a serious aperture effect is experienced. Ideally an optical anti-aliasing filter is needed between the lens and the sensor. Unfortunately, it is difficult to make a filter that has a sharp cut-off and it is usually necessary to compromise between visible aliasing and picture softness.
Using oversampling, this compromise is unnecessary. The optical anti-aliasing filter only needs to prevent aliasing at the higher sampling rate. The output of the CCD element is spatially low-pass filtered and decimated to produce a TV signal with the target pixel count. It will contain no spatial aliasing and will not suffer loss at the band edge.
As a CRT display is a sampled device, breaking the picture up into lines, it should ideally be followed by an optical filter. As before, this is not done because in order to eliminate the raster it would intrude into the passband. Oversampling can also be used to render the raster invisible. Once more a form of video standards converter is required, but this now increases the number of input lines using interpolation. The aperture effect of the display filters out the raster, leaving the passband unaffected.
The adoption of progressive scan allows spatial oversampling to be easily implemented in both camera and display. The number of lines needed in the transmission channel between is then quite moderate. Let's now examine in detail the reasons behind this fact.
If we display a plain gray image on a television set, we see the vertical sampling clock and call it a raster. This is an artifact because it wasn't in the original picture. Here we will look at how reduce or eliminate visible raster using oversampling. If we do this properly, the picture looks better. It looks sharper and is sharper, even though we haven't changed the bit rate at all.
Analog television uses sampling to pass two-dimensional images down a single wire. The vertical picture axis is sampled into lines. In digital video and in computer graphics, we extend the sampling process into the horizontal axis, converting a two-dimensional image into an array of pixels.
Sampling is a process of temporal or spatial periodic measurement. Audio waveforms are sampled temporally with a clock rate measured in Hertz. Images are sampled spatially at regularly spaced sites whose spacing is defined in various units. In an imager of a fixed size, we may have pixels per millimeter. In a pixel array, the units are pixels per picture width or height, for example, the way VGA etc. is defined.
Figure 7a)
Sampling is amplitude modulation.
Sampling is shown in Figure 7a to be a process where the source waveform amplitude-modulates the clock. This amplitude modulation causes sum- and difference-frequencies to be created around the sampling rate and its harmonics. Figure 7b shows that the sum-frequencies produce an upper sideband and the difference frequencies produce a lower sideband. If the bandwidth is increased, as in Figure 7c, the lower sideband may overlap the baseband, causing aliasing.
Figure 7b)
Correct sampling - Filter can remove sidebands
Figure 7c)
Incorrect sampling - Aliasing results
Aliasing is bad news because it replaces the original signal with something at a completely different frequency. Rather than a small irritation, it's a major defect that should be eliminated in modern systems. To prevent aliasing, the source image must be spatially low-pass filtered to half the sampling rate. In other words, in a picture 768 pixels wide, no spatial frequencies beyond 384 cycles per picture width can be allowed to pass the filter. This is the spatial frequency where the baseband and the lower sideband just meet in an ideal system and is sometimes called the Nyquist frequency. To prevent the sidebands emerging from the output of a sampling system, an identical low pass spatial filter, called a reconstruction filter, should be included at the output.
Figure 7d)
Proper video scaling
Figure 7d shows that the impulse response of an ideal low pass filter is a sinx/x waveform. With the correct filter, the periodic zero crossings in the impulse response coincide with the sites of adjacent pixels. This has the desirable effect that at the location of a given pixel, the impulses due to other pixels are all zero so that only one pixel determines the output waveform at that point. In other words the output waveform must join up the tops of the samples.
In between samples, the filter output waveform is the sum of a large number of impulses that recreates the original band-limited waveform. This is the reconstruction theorem, and if it didn't work, there would be no digital audio or video, and this document wouldn't have a lot to say.
Figure 7e)
Rigorously correct imaging system. Don't confuse this with ordinary television
Figure 7e shows a formally correct sampled image portrayal system. It has an optical spatial low-pass filter prior to the image sampler, a sample transmission system, and an optical spatial low-pass filter after the display. Such a system would display natural images because the input filter would prevent spatial aliasing and the output spatial filter would remove the visible raster. There is nothing in sampling theory that says we can't display a sharp artifact-free picture with no visible raster. The corollary is that if we can see the raster on a television screen, we must be doing something wrong.
So why is the raster visible in practice? Basically, there are three problems: the availability of filters, the aperture effect, and the fact that sampling theory was not understood when television systems were first designed.
Figure 7f)
Response of ideal spatial filter - dream on
Figure 7f shows the ideal spatial frequency response of an optical filter having a uniform passband followed by a sharp transition to a heavily attenuated stopband. Unfortunately the best that can be done in practice is a very gentle slope shown in Figure 7g. This is hardly surprising because an optical filter having an impulse response like the one in Figure 7f cannot be realized optically. In the electrical domain there is no difficulty having negative voltages, but negative light is problematic. Consequently all practical optical filter impulse responses have to be unipolar, and this badly limits the steepness of the filter.
Figure 7g)
Response of practical spatial filter Practice doesn't reach theory. With Oversampling it can.
Using a non-ideal or gradual filter like this would prevent aliasing and render the raster invisible, but it would also remove so much high frequency signal that the picture would look awfully soft. With the restricted technology of the day, early television system designers opted to avoid the picture softening, but had to accept the inevitable aliasing and visible raster.
However, even if no additional filtering is used at all, the resolution of the picture will still be impaired because there is inadvertent filtering in all imaging systems due to aperture effect.
Figure 7h)
Point sampling after Anti-Alias filter gives best resolution
Sampling theory assumes ideal point sampling as shown in Figure 7h because this gives the flattest frequency response and the sharpest picture. Unfortunately, a point sample in a camera doesn't gather much light, and so the signal to noise ratio will be inadequate. Practical CCD cameras gather light over the largest possible area as shown in Figure 7i. This causes a near 100% aperture effect and the frequency response rolls off as shown.
Figure 7i)
Practical CCD sensor uses maximum area to reduce noise
In a CRT, point sampling isn't implemented either. The electron beam isn't infinitely small; it's a Gaussian intensity function, whose frequency response is shown in Figure 7j. Putting the camera of Figure 7i in series with the display of Figure 7j is far from ideal. The response in Figure 7k is the sum of the losses due to two aperture effects.
Figure 7j)
Practical CCD sensor uses maximum area to reduce noise
Figure 7k)
Overall response of camera + display Aperture effects mean that conventional systems fall far short of ideal.
Because practical imaging devices can't be made without aperture effect, we can be confident that a real imager having n-lines cannot possibly have n-lines of resolution. In other words, in a conventional n-line television system, the signal passing between a camera and a display may have an n-line structure, but it cannot possibly contain n-lines of information. Thus a system which has the same number of lines in the imager, transmission, and display is inefficient because the transmission bandwidth exceeds the information contained.
When analog television systems were being designed, the technology of the day didn't allow anything other than the simplest approach where camera, transmission, and display all handled the same signal. With digital techniques, those restrictions no longer apply. This is why analog television broadcasting has to be replaced with efficient digital systems; we simply can't afford the waste of radio spectrum.
A system that has a different number of lines in sensor, production equipment, transmission medium and display is advantageous, but impossible in the analog domain. With digital techniques there is no fundamental difficulty. We can send the same picture quality with less bandwidth, or we can send much better quality with the same bandwidth.
A better quality picture might, for example, look sharper and have more detail. Many people think that this requires a traditional high definition television system, which needs lots of lines throughout, but it's a myth. Traditional high definition television systems are just as inefficient as traditional standard definition systems and represent an evolutionary dead end because the bandwidth will never exist to broadcast them except experimentally.
In the digital domain, there is a much more elegant solution to sharper pictures: oversampling. Cameras and displays need lots of lines to overcome aperture effects and to render the raster invisible, but the transmission medium between doesn't.
Although we can't improve the shape of optical filters and we can't change the physics of the aperture effect, we can, however, change the frequencies at which the effects occur. A 480 active line camera can't give 480 lines of resolution, but a camera having more lines can, and the output can be converted to a 480 line signal with a downconverter. Effectively, the camera is using oversampling.
Figure 7l shows that in an oversampling camera, the spatial sampling rate must be increased by using a larger number of pixels in both dimensions (i.e. use an HD camera). In this example the factor is two although other factors are possible. The optical anti-aliasing filter then only needs to prevent aliasing at the twice the original sampling rate. Although the optical filter has a gradual cut-off slope, we are only interested in the bottom half of the band, where the response is reasonably uniform.
Figure 7l)
Oversampling temporarily raises sampling rate so that a real filter can prevent aliasing without attenuating the wanted passband. In the digital filter near-ideal filters can be realized and the sampling rate reduced.
The output of the CCD element is digitized and fed to a signal processor designed to act as a spatial low-pass filter having a response shown in Figure 7l. This cuts off at half the original sampling rate, thus allowing that rate to be halved without aliasing. The television industry calls this a downconverter. In computer terminology it's a resizer. The result is a TV signal with the target pixel count. It will contain no spatial aliasing, but will not suffer loss at the band edge.
The reason oversampling works is that real sensors and displays have aperture effect, whereas a pixel represented by a binary number doesn't. The sensor and display need more sampling sites to handle all of the information in a pixel array. It is quite possible to have a series of pixels at the Nyquist frequency, even though no camera or display could resolve it directly.
Another way of looking at oversampling is that the ideal anti-aliasing filter would be digital, but the Catch-22 is that you can't use a digital filter until after the sampling stage and then it's too late. In oversampling, the sampling rate is temporarily raised to allow a realizable optical filter to be used prior to digitizing. Once in the digital domain, the real anti-aliasing process is in a digital filter where the bandwidth and sampling rate are defined.
Note that we can only avoid spatial aliasing using a progressive scan sensor. An interlaced sensor has half the vertical sampling rate in each field and when the fields de-correlate due to motion, vertical detail suffers spatial aliasing. An interlaced signal isn't separable, which means that the time axis and the vertical image axis aren't orthogonal. A downconverter needs to be motion vector steered so that detail from two fields can still be combined in the presence of motion. Oversampling may be used with interlaced cameras, but it is heavily sub-optimal, as it doesn't eliminate artifacts. Progressively scanned video is separable so that downconversion can treat each frame separately, and this is much easier as well as giving better results.
In addition to extracting the maximum resolution at the sensor, oversampling should also be used at the display to render the raster invisible. Once more, a form of resizer is required, but this now increases the number of lines using interpolation. As before, the process is much easier to implement and works better with progressively scanned input signals because they are separable.
Figure 7m)
Interpolator puts a gap between Baseband and Sideband
Raising the number of sampling sites describing the picture puts a gap in the spectrum as shown in Figure 7m. Instead of an impossible steep-cut filter, a realizable filter with a gentler slope in Figure 7n is perfectly adequate to eliminate the sampling sidebands.
Figure 7n)
Gentle filter can remove Sideband without affecting Baseband
The aperture effect of the display can be turned to advantage to filter out the raster, leaving the passband unaffected. Figure 7o shows the signal displayed conventionally on a CRT with the separate intensity functions of each line making the raster visible. Figure 7p shows the result with oversampling. The intensity functions overlap and the depth of modulation due to the raster falls. The same technique is used to transfer digitally manipulated images back onto movie film without the any visible raster.
Figure 7o)
Conventional CRT has deep raster modulation
Figure 7p)
Oversampling CRT gives reduced depth of raster modulation Oversampling maximizes resolution of CRT while minimizing visibility of raster
The adoption of progressive scan allows spatial oversampling to be easily implemented in both camera and display, and allows camera aliasing to be suppressed. The number of lines needed in the channel between is then quite moderate.
Progressively scanned sensors and displays having 700 to 1000 lines connected by a 480P channel are all that is required to deliver a truly high definition television service. Oversampling makes the transmission inherently scalable because the upconverter in the display is optional and lower cost receivers could omit it. This would result in a picture that was less sharp and where the raster would be visible. This might be acceptable in a small portable TV.
Large expensive receivers would use upconversion to recover all of the resolution in the transmitted signal and suppress the raster. These could also incorporate motion vector steered frame rate upconversion to reduce background strobing.
Figure 7q)
Interlace is a lossy compression scheme. At 2:1 oversampling is lossless. R.I.P. interlace.
Figure 7q contrasts the traditional and oversampled approaches to transmitting a television picture. If we began with a 4x3 aspect ratio standard definition progressively scanned digital picture having square pixels, it would contain something like 704x528 = 371,712 pixels.
In an interlaced system we discard every other line to create a field. This results in 704x264 = 185,856 pixels, halving the required bandwidth.
Using oversampling, we pass the entire picture through a downconverter which reduces the sampling rate in both axes by the square root of 2 or 1.41. The output picture will then be 498x373 = 185,754 pixels, also halving the required bandwidth. The downconverted picture simply removes the bandwidth that is not explored in real cameras because of aperture effect. In other words, the downconversion is effectively lossless.
When displaying an interlaced picture, at the display we put the interlaced picture straight on the screen, along with interlace flicker, visible raster, and poor dynamic resolution with vertical aliasing in motion.
When the oversampled picture is to be displayed, it is simply upconverted to, say, 996x746 pixels and displayed with no visible raster, no spatial aliasing, excellent dynamic resolution and no half frame rate components. Essentially oversampling is a near-lossless or artifact-free way of halving the bit rate from a sensor. Interlace is an inferior way of halving the bit rate, which results in a high level of artifacts and damages dynamic resolution.
In an MPEG environment, the downconverted progressive scan pictures are easier to compress than the interlaced scan pictures for various reasons. Firstly, in an interlaced picture, the macroblocks cover twice the screen area so motion compensation is more approximate. The motion compensation is further degraded in interlace because the vertical aliasing confuses the motion estimator. Consequently, although the input bit rates in our example are the same, an MPEG compressor could achieve a lower output bit rate on the progressive picture than on the interlaced picture for the same level of compression artifacts.
The figures here are simply examples, but the principle holds at any picture size, allowing commercially available equipment to be used. Using oversampling, 720p cameras and displays connected by a 480p channel give extremely good results.
Figure 7r)
Interpolation is the process of computing the values of output samples that lie between the input samples (i.e., the samples in the original video signal). It is thus a form of sampling rate conversion. One way of changing the sampling rate is to return to the analog domain using a Digital to Analog Converter and then to sample at the new rate. In practice, this is not necessary because the process can be simulated in the digital domain. When returning to the analog domain, a suitable low pass filter must be used which cuts off at a frequency of one half the sampling rate.
The impulse response of an ideal low-pass filter is a sinx/x curve that passes through zero at the site of all other samples except the center one. Thus the reconstructed waveform passes through the top of every sample, as shown in Figure 8a. Between samples, the waveform is the sum of many impulses. In an interpolator, a digital filter can replace the analog filter.
Figure 8a)
A digital filter can be made with a linear-phase low-pass impulse response in this way. As a unit input sample shifts across the filter, it is multiplied by various coefficients which produce the impulse response. Figure 8b shows how this could be implemented. A 'windowed' sinx/x impulse response can be described using a set of coefficients stored in a look-up table (LUT).

Figure 8b)
The interpolation method usually employed involves taking the contribution of each input sample at the corresponding distance from the required output sample. All the contributions are summed to obtain the interpolated value. Figure 8c shows the process needed to interpolate to an arbitrary position between samples. The location of the output sample is established relative to the input samples (this is known as the phase of the interpolation), and the value of the impulse response of all nearby samples at that location is added. In practice, the coefficients can be found by shifting the impulse response by the interpolation phase and sampling it at new locations. The impulse will be sampled in various phases and the coefficients will be held in a look-up table. A different phase can then be obtained by selecting a different LUT page.
Figure 8c)
Oversampling can also be used in the time domain in order to reduce or eliminate display flicker and background strobing. A different type of Standards Converter is necessary, which increases the input picture rate by interpolation. Such an oversampling converter should use motion vector steering, otherwise moving objects will not be correctly positioned in an interpolated picture and the result will be judder.
A conventional linear frame rate converter either just uses a frame store, or better, filters along the time axis by feeding the same pixel from several successive frames into an FIR filter. A temporal aperture of four frames is common, although for some applications, only two frames are used for economy. With such a short aperture, it is not possible to reach an acceptable compromise between roll-off and ripple and eliminating beating between the input and output frame rates is very difficult.
Figure 9a)
Linear filters (or no filtering at all in the case of just using a frame buffer) suffer from a major defect when used for frame rate changing. If an object is moving, it will be in different places in successive fields. Interpolating between several fields results in multiple images of the object. The position of the dominant image will not move smoothly, an effect that is perceived as judder. If, however, the camera is panning, the moving object it will be in much the same place in successive fields and Figure 9a shows that it will be the background that judders.
Motion vector steering is designed to overcome this judder by taking account of the human visual mechanism. It is a way of modifying the action of a frame rate converter so that it follows moving objects along the optic flow axis to eliminate judder in the same way that the eye does. The basic principle of motion vector steering is simple. In the case of a moving object, it appears in different places in successive source frames. Motion vector steering computes where the object will be in an intermediate target frame and then shifts the object to that position in each of the source frames prior to temporal interpolation.
An alternative way of looking at motion vector steering is to consider what happens in the spatio-temporal volume. A conventional Standards Converter interpolates only along the time axis, whereas a motion vector steered Standards Converter can swivel its interpolation axis off the time axis onto the optic flow axis. Figure 9b shows the input frames in which three objects are moving in different ways. It will be seen that the interpolation axis is aligned with the trajectory of each moving object in turn.
Figure 9b)
When this is done, each object is no longer moving with respect to its own interpolation axis, and so on that axis it no longer generates temporal frequencies due to motion and temporal aliasing cannot occur. Interpolation along the correct axes will then result in a sequence of output frames in which motion is properly portrayed. The process requires a Standards Converter that contains filters that are modified to allow the interpolation axis to move dynamically with each output field.
The signals that steer the interpolation axis are known as motion vectors and one of these must be available from the motion estimator for every pixel in the target frame. These are not just block based motion vectors. It is the job of the motion estimation system to provide these pixel accurate motion vectors. The overall performance of the converter is determined primarily by the accuracy of the motion vectors. An incorrect vector will result in unrelated pixels from several fields being superimposed and the result is unsatisfactory.
Motion vector steering should also be used when converting an interlaced scan video signal into progressive scan format. As the interlace process reduces the information in each field, the job of the motion estimator is somewhat harder. It is only economically feasible to use motion vector steered de-interlacers within TV studios for converting archive material for transmission in the new progressive transmission format.