Let’s begin this chapter by reviewing how we separately source-code voice, text, image, and video content. The following sections treat each of these topics in detail.
Voice
We have already discussed the adaptive multirate vocoder and wideband vocoder specified by 3GPP1. This is a speech synthesis codec and, as a result, provides us, con- veniently, with the ability to support speech recognition, also specified by 3GPP1. The better the accuracy of the speech recognition (the distance from user to user), the higher the value. Similarly, the better the voice quality (measured on a mean opinion score), the more user value we deliver, but the more it costs to deliver, because of a higher coding rate.
Source Coding
7
Figure 7.1 Audio codec—time domain to frequency domain transform.
These audio codecs use a time domain to frequency domain transform (discrete cosine transform) to expose redundancy in the input signal (see Figure 7.1). We send filter coefficients that describe the spectral/harmonic (frequency domain) content of the 20-ms speech sample.
MPEG-4 also has an audio coding standard including a very low bit rate harmonic codec (2 to 4 kbps) and a codebook codec (4 to 24 kbps). The codebook codec stores waveform samples in the decoder. When the digital filter coefficients are received, the decoder goes and fetches the closest-match waveform from the decoder—hence, the need for good memory fetch management in these devices. The intention is that the MPEG-4 CELP (codebook excitation linear prediction) codec will be compatible with the AMR-W codec, which has a similar codec rate range.
Text
Having captured our wideband (16 kHz) audio, we now want to add some text. Text source coding has traditionally been realized using ASCII (American Standard for Communications Information Interchange). These are 7-bit words that are used to form a 7-bit alphabet used to describe letters of the alphabet, numbers, full stops, and other text necessities.
ASCII works okay for Latin script (English, etc.) but runs out of address bandwidth if a more complex language has to be described (for example, Japanese, with thou- sands of characters). Japanese, Chinese, Arabic, or Hebrew SMS can be realized using USC2 (Universal Multiple Octet Coded Character Set), a 16-bit/2-octet character string, or UCS4, a 32-bit/4-octet character string.
ASCII, UCS2, and UCS4 all allow perfectly acceptable representation of text on a grayscale LCD. However, we have said that we are beginning to see an increasing use of high-definition high color depth displays. These displays provide us with the capa- bility to do text rendering by using pixel manipulation.
Pixel elements are made up of pels (picture elements) representing the singular red, green, or blue value of an RGB pixel. Remember that the number of bits used per pixel determines the amount of control you have over the color balance—24 bits gives you high color depth. The size of the image is the product of the number of pixels times the number of bits per pixel.
Time Domain Data Real or Complex
Frequency Domain Complex FFT
Text rendering is effectively subpixel manipulation, borrowing subpixels from adja- cent whole pixels. The borrowed subpixels are always adjacent to their complementary color pixels, which our eyes mix to form white. We can therefore use subpixel manip- ulation to clean up jagged edges. Subpixel manipulation also only works on the hori- zontal resolution of LCDs. Even so, this means we can do the following:
Emboldening(stretching text horizontally)
Ke r ning (shifting text horizontally, that is, micro-justification)
Italicizing(slanting type by skewing it horizontally)
Subpixel manipulation only works for LCDs, not CRTs. CRTs are not addressable at subpixel level, but then, as yet, no digital cellular handsets have CRT displays.
This means we can produce book-quality text on our screens, if we so desire. We must be aware, however, that not all LCDs have the same ordering of RGB subpixels. The rendering engine needs to know whether subpixels are arranged in forward or reverse order. Also, text rendering only works for landscape not portrait aspect dis- plays, which means it is not really suitable for e-books, which would be an obvious application. Text rendering is now, however, included in a number of software prod- ucts (Windows 2000 being one example) and will likely begin to appear further down the portable product food chain at a later date.
Image
Now that we have added beautifully rendered text to our wideband audio, it is time to add image bandwidth. An A4 image scanned at 300 dpi resolution and 24-bit color, however, produces a 24-Mbyte file—potentially a memory and delivery bandwidth embarrassment. As a result, we have a choice of lossless or lossy compression.
In lossless compression, all the data in the original image can be completely con-
structed in the receiver. Lossless compression is typically used in medical imaging, image archiving, or for images where any loss of information compromises application integrity. The problem with lossless compression is that it is hard to achieve compres- sion rates of more than 2:1 or 3:1.
An example of a lossless compression technique used for storage system optimiza- tion is a dictionary-based scheme developed by Loughborough University and Actel, a memory product vendor. This compression technique has a learning capability and builds up a dictionary of previously sent data, which it shares with the receiver. If an exact match can be made, only the dictionary reference needs to be sent. If an exact match is not possible, the information is sent literally—that is, with no compression.
In lossy compression, we take the decision that a certain amount of information can be thrown away. The impact of discarding the information is either not noticeable or it is acceptable both to the person or device sending or storing the image or to the person or device receiving or storing the image. Compression ratios of 40:1 or higher are rela- tively easy to achieve with lossy compression. Compression schemes tend to be opti- mized either to improve storage bandwidth efficiency or delivery bandwidth efficiency, but not necessarily both.
Image compression standards are codified by the Joint Picture Experts Group, or JPEG. The Joint Bi-level Image experts Group (JBIG) looks after document compression,
document scanning, and optical character recognition (OCR). Bi-level means black and white, but the group also addresses grayscale compression. JPEG 2000 is the unified stan- dard covering lossy and lossless compression and introduces the concept of Q factor.
A JPEG image is built up of a number of 8 x 8 pixel blocks that are transformed (like our audio codec) from the time to the frequency domain. The frequency content of the image is described by a string of digital coefficients. If one pixel block exactly matches the next, effectively, a “same again” message is sent. For example, endless blue sky would produce a whole series of identical pixel blocks. If a cloud appears, this changes the frequency content, and new digital coefficients need to be generated and sent—or perhaps not. We can choose to ignore the cloud, pretend it isn’t there, and send a “same again” message, but some important information will have been left out.
A Q factor of 100 means any difference between pixel blocks is coded and sent. A Q of 90 means small block-to-block differences are ignored with some (hardly noticeable) loss of quality. A Q of 70 means larger block-to-block loss of quality, but it still is not very noticeable. In digital cameras, a Q of 90 equates to fine camera mode, and a Q of 70 equates to standard camera mode. We choose 70 when we want to fit more pictures into the memory stick or multimedia card. The choice of Q, however, also determines delivery bandwidth requirements.
As mentioned, the noticeability of quality degradation is also a product of the qual- ity of display being used: A poor-quality display does not deserve a high Q picture; a good quality display is wasted if a poor Q is used.
Say we have a picture taken in fine camera mode (Q = 90), which creates a file size of 172,820 bytes. This will take 41.15 seconds to send over an uncoded 33.6 kbps chan- nel (this is assuming the user data rate is the same as the channel rate with no forward error correction added in). If we took the same picture and had a Q of 5, the file size would reduce to 12,095 bytes and we could send it at the same channel rate in 2.87 sec- onds. The cost of delivery would be 15 times less for the Q-5 file. The question is, how much would the quality be impaired and how much value would be lost.
This highlights an important issue. Voice-quality metrics are well established. We use a mean opinion score to provide an objective way of comparing subjective quality assessments. For instance, we put 10 people or 100 people in a room and ask them to score a voice for quality, and then produce a mean opinion score (MOS) to describe the perceived quality. JPEG Q gives us an objective measurement of image quality, but we do not presently have a way of setting this against a subjective scorecard. As we will see later, the same problem occurs with video quality.
This is important when we come to negotiate network quality with a customer. In a 2G cellular network, we agree with a network operator to a certain bit error rate (typi- cally 1 in 103). This is deemed acceptable and defines the coverage area in which the
radio signal will be sufficient to deliver the defined BER or better. We can then show how this BER relates to voice quality and define the MOS achievable across a percent- age of the coverage area.
Video
No such established relationship presently exists for image or video quality. We also need to consider that compression ratios increase as processor bandwidth increases.
As a rule of thumb, you can expect video compression ratios to increase by an order of magnitude every 5 years. In 1992, a data rate of 20 Mbps was required for broadcast- quality video. By 1997, this had reduced to 2 Mbps. However, as compression ratios increase, the quality of the source-coded material decreases. Digital TV provides an example, as shown in Table 7.1. A compression ratio of 100:1 yields VHS quality; a com- pression ratio of 10:1 yields high-definition TV.
Inconveniently, higher compression ratios also mean the data stream becomes more sensitive to errors and error distribution (burst errors) on the channel. These can be coded out by block coding, convolutional coding, and interleaving, but this introduces delay, and, of course, time is money.
If we take the historical trend forward, by 2007 we could have compression ratios of 500 to 1. These will work very well over low BER consistent physical channels—for example, an ADSL line specified at 1 in 1010BER or optical fiber specified at 1 in 1012
BER (1 in 10,000,000,000,000 bits errored—effectively an errorless channel). These highly compressed media files will work less well over inconsistent, relatively high BER radio channels.
This brings us to the issue of differential encoding. In JPEG, we compare one pixel block with another and produce a difference figure. In MPEG, we do the same, but in addition, we look for similarities from image to image and express these as a difference coefficient. The problem with differential encoding is that it does not like delivery bandwidth discontinuity—for instance, burst errors on the radio channel or non- isochronous packets in the network.
The problem is partially overcome by using periodic refresh pictures. This is known as intracoding. The refresh pictures are only spatially, not temporally, compressed. Even using intracoding, differentially encoded video streams can be very jerky when sent over a wireless network (particularly, as we discuss later, over a wireless IP network). An alternative is to use JPEG for video. Individual still images become moving images by simple virtue of being sent at a suitable frame rate per second. JPEG does not use differencing and therefore avoids the problem, but it does not provide the same level of compression efficiency.
The better answer is to improve radio and network bandwidth quality. Better radio bandwidth quality means avoiding burst errors in the radio channel, better network bandwidth quality means avoiding transmission re-tries and minimizing delay and delay variability. This then allows the efficiency benefits of differential encoding to be realized.
Table 7.1 Compression versus Quality in Digital TV
COMPRESSION RATIO CHANNEL RATE RESOLUTION
10:1 20 High definition
20:1 10 Enhanced definition
40:1 5 PAL