Click Here to Install Silverlight*
United StatesChange|All Microsoft Sites
Windows Media Player 9 Series*
Search Microsoft.com for:
|Windows Media Worldwide

Optimizing Low Bit Rate Audio

Bill Birney
Microsoft Corporation
May 2003


Introduction

Many Internet broadcasters and content providers use the Windows Media Audio 9 codec to deliver audio content at 32 kilobits per second (Kbps) with quality that is similar to FM radio. Others use bit rates of 64 Kbps to 128 Kbps to deliver quality similar to audio CDs.

However, there are a large number of Internet broadcasters and other producers who can take advantage of very low bit rate audio to deliver AM radio quality music and voice-only streams over slow and unreliable networks. The standard codec streams as low as 5 Kbps with a sampling rate of 8 kilohertz (kHz), and is designed to compress a variety of sounds. However, if your content is voice-only, you may be able to achieve higher quality at lower bit rates by using the Windows Media Audio 9 Voice codec, which is designed to optimize quality when the content is voice. The voice codec can also compress as low as 4 Kbps at 8 kHz. If audio contains a mix of music and voice, you can configure the voice codec to switch automatically between the voice and music compression algorithms.

Feedback
E-mail us with your comments and feedback about this article.
 
Abstract
Create content that streams at very low bit rates with quality similar to AM radio, and encode voice-only content that streams at a bit rate as low as 4 kilobits per second (Kbps). This article describes how to encode with the Windows Media Audio 9 and Windows Media Audio 9 Voice codecs, and how to think like a codec to optimize the quality of your low bit rate audio.
At very low bit rates, you achieve audio similar to telephone quality. The secret to getting the best sound at very low bit rates is to understand how the codec works. You can achieve the best voice quality by making sure you record voice content in a way that helps the codec do the best job of compression.

This article provides a high-level view of how the Windows Media Audio 9 codecs work with low bit rate content so that you can create the best sounding audio. The article contains the following topics:
Back to the top of this page Back to the top


Codec Basics

You don't need to know every detail of how a codec works to create content, but it helps to know the basics.

All codecs, whether they are designed to work with audio data or data contained in a spreadsheet, compress content by removing data. Then, when the content is decoded on the client, the codec adds data back in. A lossless codec recreates the original exactly; a lossy codec creates an approximation of the original. A lossless codec may be used to compress text documents, for example, where an approximation is not good enough. However, in order to achieve the amount of compression needed to stream digital media over networks and to minimize storage requirements, Windows Media codecs-and most other digital media codecs-use lossy compression. This means that information is lost during the compression process. The lower the bit rate of the digital media, the more data the codec must lose.

The job of a digital media codec is to effectively control the type of data that is removed, so that the compressed audio has the best sound possible. To design the compression algorithms that are used in Windows Media Audio codecs, principles of psycho-acoustics are used. These principles help identify what parts of the sound can be modified to reduce data, while creating a listening experience that is perceptually as close as possible to the original. By applying psycho-acoustic principles and the results of many listening tests, Windows Media codec designers create algorithms that find efficiencies in the sound data.

You can, in a sense, help the codec decide what to remove by simplifying the sound before the codec processes it. This is why voice can be compressed more, because the sound is less complex than music. By simplifying sounds we mean:
  • More predictable. Music contains many highs and lows in dynamics and an infinite number of variations in tonal complexity. Voice, on the other hand, is generally limited to a smaller set of potential sounds.


  • Narrower frequency range. The portion of vocal sound that provides intelligibility can be reduced to a narrow range of frequencies. Music and most other sounds cover a very broad range of frequencies: from the very low rumble of an avalanche or freeway traffic to the very high frequencies found in bird sounds and cymbals. Some sound sources contain white noise, which consists of all frequencies.
The Windows Media Audio 9 standard codec can reproduce most complex sounds well at bit rates above 20 Kbps. However, when using very low bit rates with the voice codec, you can improve the quality of the compression by reducing the complexity of the source sound.

Back to the top of this page Back to the top


Voice Codec

To compress low bit rate content, you can use the standard Windows Media Audio 9 codec. However, the Windows Media Audio 9 Voice codec can provide the best quality speech compression at very low bit rates because it is designed for the particular complexity of the human voice. At very low bit rates, the goal of the voice codec is not to create a perfect reproduction of vocal sounds, but to provide good quality with the most intelligibility and least number of artifacts. The voice codec generates artifacts when a sound is too complex, such as when it contains non-voice content. The more complex a sound is, the more likely it is the codec will remove critical voice content to achieve the very low bit rates and output data that sounds artificial.

In voice mode, the codec algorithm is optimized for voice. If the sound data contains voice and no background sounds, such as music, crowd noise, and air conditioning, the codec compresses with low artifacts and high intelligibility. If the sound contains a mix of voice and non-voice material, you can configure the encoder to switch automatically between voice and music modes. In music mode, the codec compresses the data the same way as the Windows Media Audio 9 standard codec. By using automatic selection, you can encode a low bit rate stream for a very slow or congested network, for example, and the codec automatically chooses the best compression method.

The codec switches between modes by detecting complexity in the sound data. However, keep in mind that just because sound data is complex does not mean it is music. Often, complexity is added to the data by background noise, which triggers the codec to switch to music mode. Therefore, to optimize sound for the voice codec, you can reduce background noise and increase the quality of compression.

Back to the top of this page Back to the top


Scenario

We will describe how to optimize sound for the voice codec by presenting a fictitious scenario. In this scenario, we will use the voice codec to encode a reading of a short story for an e-book. Users with Internet connections as low as 13 Kbps will then be able to stream the file from a Web site. We will describe the components of the recording system and how they are configured. The best recordings can be made in an acoustically controlled environment, such as a studio. However, to make the scenario more interesting, we will record the speaker in a noisy auditorium.

Our setup includes the following components:
  • Lavalier microphone. This type of microphone is made small so that it can be clipped to a tie or attached to a collar. Figure 1 shows a typical lavalier microphone.

    Lavalier microphone
    Figure 1. Lavalier microphone

    The advantage of this type of microphone is that it can be positioned close to the source of the sound, the author's mouth, and because the lavalier is connected to the speaker, the position does not change much. Close proximity to the sound source makes the sound clearer and enhances the isolation of the source from external sounds. By isolating the voice, complexity is reduced, and the sound is more predictable for the codec. Also, lavaliers tend to have a smaller pickup pattern, which provides further isolation, and many have a narrower frequency range that is more suited to voice. The microphone on the podium could be used, but the distance between the speaker and microphone will change as the speaker moves around. Also the podium microphone is more likely to pickup extraneous sound.


  • Audio mixer. The microphone signal is run through a small mixer that has a three-band equalizer. The equalizer is an advanced tone control, which we will use to reduce the parts of the signal that are not part of the voice sound. Specifically, we will reduce or roll-off the low frequencies that contain sounds such as air conditioning and traffic rumble; roll-off the high frequencies that contain high-pitched hissing sound without removing intelligibility from the voice, in particular the "S" and "T" sounds (sibilant sounds); and we may increase the mid-range frequencies to bring out the voice. Again, we are not attempting to faithfully reproduce the sound of the speaker, but to make the sound less complex and emphasize the intelligibility and clarity of the voice.


  • Computer and Windows Media Encoder 9 Series. We can use a computer with a relatively slow CPU, because encoding and compressing low bit rate audio requires a small amount of system resources compared to encoding high-bandwidth video for example. Available hard disk space can also be modest because the voice files will be comparatively small. A one hour file recorded at 13 Kbps requires only around 4.5 megabytes (MB) of storage. To make our recording truly portable, we will capture on a laptop with a CPU speed of 400 megahertz (MHz) and a 10 gigabyte (GB) hard disk.
The microphone is connected to a microphone input on the mixer, and the mixer is connected to Line In input on the laptop.

Setting Up the Encoder

We will use the following procedure to configure the encoder:
  1. Adjust the volume and EQ on the mixer. Use the microphone channel volume control and VU meters to optimize the audio level. The level should be as high as possible without clipping (hitting the maximum point on the meter). In addition to a loud scratching sound, clipping adds complexity, which increases artifacts. If the levels vary widely, make sure to leave plenty of range between the highest level and the clipping point. You can set the EQ controls similar to those shown in the following illustration. The exact settings can only be made by listening to the results.

    Three-band equalizer settings for voice recording
    Figure 2. Three-band equalizer settings for voice recording


  2. Set up the encoder session. In Windows Media Encoder, enter the following settings:
    • Source. Audio capture card.
    • Output. The path and name of the file you want to encode.
    • Compression. Click the Windows Media Audio 9 Voice codec, and use the 10 Kbps 11 kHz audio format. Keep in mind that with sampling rates less than 22 kHz, sibilance can be severely reduced, which reduces intelligibility.
    • Processing. If the recording is voice-only, and background noise is low, select Voice only optimization. If there is some music, select Audio with voice emphasis.

      When you source from a file and find the codec does not switch between modes the way you want, you can create an optimization definition file. This file is a list that contains explicit timing instructions for switching modes. For more information, see Windows Media Encoder Help.

  3. Synchronize VU meters. On the encoder VU meter, click the Mixer button, and then adjust the Line In control, so that the levels on the encoder meter match the levels on the audio mixer.

Back to the top of this page Back to the top


Additional Tips

If the recording contains some music, such as background or intro music, use automatic selection to achieve the highest quality. On the other hand, if the stream will contain mostly music, such as an Internet music radio station, you should use the standard codec. You can achieve roughly AM radio quality with the 20 Kbps, 22 kHz format setting. As with voice, you can help the codec by reducing complexity. For example, if you must encode music with an 11 kHz sampling rate, the high frequencies will merely add complexity to the data and increase artifacts. In that case, you can sometimes help the codec do its job by rolling off high frequencies.

If possible, stream music at 32 Kbps or higher. To cover a broader range of bandwidths, encode a multiple bit rate (MBR) stream. For example, you could encode an MBR stream with the standard audio codec at 20 Kbps, 32 Kbps, and 64 Kbps. When a user connects, the Windows Media server and Windows Media Player exchange bandwidth information and the server selects the stream that is most appropriate.

Another approach to encoding a speaker is to capture a file first at a high bit rate, and then encode a low bit rate file from that. For example, you can use the Windows Media Audio 9 Professional codec or Windows Media Audio 9 Lossless codec to encode an archive file at near CD quality by using a bit rate such as 256 Kbps, or a variable bit rate (VBR) mode with a quality level of 75 percent. Then, you could encode a low bit rate file from the high bit rate archive. The advantage of this technique is that you will have a high-quality capture that can be recorded to a CD or offered as a download.

Back to the top of this page Back to the top


For More Information


Back to the top of this page Back to the top



© 2008 Microsoft Corporation. All rights reserved. Contact Us |Terms of Use |Trademarks |Privacy Statement
Microsoft