Voice activity detection, or VAD, is a technique employed by digital signal processors which reduces the volume of voice traffic transmitted by automatically detecting silent periods in conversations and suspending traffic generation during those periods. Approximately 50—60 percent of most conversations is silence. This is due to the fact that while one party is speaking the other party is usually listening silently. With VAD enabled, the bandwidth that normally would have been consumed with silent voice data can be saved and allocated to other traffic types, like data.
VAD works by monitoring the power of the voice signal, changes in power, the frequency of the incoming voice signal, and changes in that frequency. VAD's challenge is in correctly identifying when speech stops, and also when it starts again. VAD waits approximately 200 ms after it perceives that speech has stopped before disengaging the packetization process. This pause helps prevent VAD from clipping the trailing portion of speech or engaging in the middle of a small break in the speech pattern. Similarly, a delay of 5 ms is introduced by the CODEC to ''hold on" to voice information in the event that speech is detected. This means that when VAD determines that a voice signal is once again present, the previous 5 ms of voice are transmitted along with the current voice signals. This delay reduces, but does not eliminate, front-end clipping where the beginning of speech is clipped.
Echo
Echo is caused by electrical reflections in the voice network. These reflections are usually the result of an impedance differential between the 4-wire switch connection and the 2-wire local loop. A little bit of echo is always present, and it is actually comforting to the speaker to hear his or her voice echoed back through the handset. However, an echo which is delayed more than 25 ms is distracting and disconcerting for the speaker. Since echo is usually caused at the remote end of the circuit, an increase in network delay beyond 25 ms will require some means of counteracting the echo.
The PSTN handles echo in two ways. One is to lower the power of the signal, thus minimizing the magnitude of the echo. The second is through the use of echo cancellers. Echo cancellers are placed in between the CO switch and the 4-wire-to-2-wire converter connected to the local loop. In packet networks, they are often integrated into the DSP used for packetization.
Page 57 Echo cancellers operate by combining the echo signal with its exact opposite. Since the echo canceller is in line between the signal origin and the point at which it is reflected, its job is to simply remember the voice patterns that flow through it, wait for them to return as echo, and then apply the inverse of the original voice pattern to the returning echo. Figures 3-15 through 3-19 illustrate this process.
DSP reserves memory space to record processed signals and store them while waiting for return echo. The time during which the DSP can wait for the return echo is limited by the size of the memory allocation on the DSP.
Echo cancellation enables networks with larger delays to be built since echo is removed close to its source.
Figure 3-15
Initial voice signal is transmitted from the switch.
Figure 3-16
Echo canceler stores an inverted sample of the original signal.
Page 58
Figure 3-17
Figure 3-18
Echo canceler combines echo with inverted sample in memory.
Figure 3-19
Echo canceled signal is transmitted back toward the source.
Page 59
Delay
Delay is important because it directly affects the perceived quality of the phone call. Increased delay leads to talker overlap and echo. Calls with excessive delays are difficult on the
participants because they lengthen the amount of time between conversational responses, making it hard to keep a conversation in synch. This creates a situation somewhat analogous to
congestion conditions in data networks. The sender's patience in waiting for a response may run out causing him or her to re-ask the question (retransmit) even though the response may already be on the way back.
Troublesome echo is caused when the end-to-end delay in the network is above 25—35 ms. At this point, echo distracts the speaker and begins to degrade the quality of the call. Echo
cancellation as described above is an effective means of limiting this problem. Delay can be divided into two components, propagation delay and handling delay.
1. Propagation delay refers to the speed at which the electrons flow through the copper wire network to deliver the transmitted message. This is very low since electrons travel at 100,000
miles per second in copper. This translates to approximately 30 ms for a cross-country copper path of 3000 miles. Propagation delay should not be confused with serialization delay.
Serialization delay is the time required to transmit data on link and is based upon the operating frequency of the link. The serialization delay of a 100-mile 64-kbps link is 24 times that of a 100 mile T1, yet the propagation delay is the same for both. In summary, propagation delay is based upon the transmission medium (copper/fiber) and the distance while serialization delay is based upon the signaling rate of the circuit.
2. Handling delay is introduced by all the components which handle the voice traffic during its transmission. In a voice over IP network the following items add to the one-way delay of an end-to-end voice transmission:
• Digitization of analog voice signal • Compression of digital voice signal • Packetization of voice traffic • Queuing of packetized voice
• Transmission and serialization delay over the initial network link
• Queuing and serialization delays over all intermediate network elements (routers, LAN switches, WAN switches, LAN/WAN links)
• Reception and queuing of packet at destination
Page 60
Figure 3-20 Delay timeline.
• Depacketization of voice traffic • Decoding of digital voice signal
• Translation of digital signal to analog voice signal
Of these elements, the queuing and serialization delays are the most significant in WAN
configurations. Two subsequent chapters on quality of service are dedicated to the discussion of these issues. The time line in Figure 3-20 helps identify important delay measurements.
Jitter
Jitter refers to the changing arrival rate of packets from the network due to variations in the transit delay. IP networks do not offer consistent performance and often introduce large variances in the packet arrival rates. This is due to several factors, including queuing delays, variable packet sizes, and the relative load on the intermediary links and routers. To compensate for jitter, voice devices incorporate a playout buffer on the receiving device. The buffer holds on to packets long enough for the slowest packet to arrive in time to be processed in sequence.
The idea of buffering packets before processing them is diametrically opposed to the goal of minimizing delay. Unfortunately, it is a necessary evil. The quality of service techniques discussed in Chapter 3 and 4 can be
Page 61 employed to reduce jitter throughout the IP network. Since jitter cannot be eliminated, the jitter buffer must be carefully tuned to provide an optimal packet-delivery rate while minimizing delay. The process which maintains the playout buffer begins with a minimum and maximum buffer size (measured in ms). During operation, it constantly monitors the arrival rate of packets and dynamically adjusts the playout-buffer size to support the changing network conditions. In low-delay environments, the playout buffer is reduced to the minimum. In environments with highly variable delay, the playout buffer adapts slowly to reduced-delay situations and quickly to increased-delay situations. This ensures that packet loss is minimized by maintaining an adequate buffer size and absolute delay is controlled by maintaining a maximum-queuing delay.