7.6 APU Algorithm
7.6.2 Closed network testing
A novel closed-network test methodology that involves actual human subjective testing is performed to derive the APU state diagram. In our tests, human subjects are asked to rank their perception QoE (MOS score) of interactive VoIP calls for a different range of packet loss and delay configured using Dummynet [49].
The test cases can be listed as a result of all possible combinations between delay and MOS different levels as : <A, A>, <A, P>, <A, U>, <P, A>, <P, P>, <P, U>, <U, A>, <U, P> and <U, U>, where each test case is defined by a certain sequence of the network factor levels <d MOSnet, net>. For example, the <A, P> test
case corresponds to network conditions that results in an acceptable one way delay and Poor rating factor R according to the codec used.
We have carried conversation-opinion subjective tests according to the procedures provided in ITU-T P.800/P.920 Recommendation [41, 74]. Our tests are made on an isolated LAN with no cross traffic. Before the test, the people shared in the test were informed about its purpose, procedures and the benefits from the test. ITU-T recommends 16 persons as the minimum number for the accuracy required for the results [74].
In order to obtain a wide range of subjective quality scores from our testing, 20 human subjects shared in the test. We classified the human subjects into general and experienced users. General users are those who have moderate experience due to their occasional usage with the VoIP applications. Experienced users are considered those users who use VoIP applications frequently and they know their concern from using it. According to the ITU Recommendation P.920 [74] our tests were based mainly on the Name-Guessing task which is based on a question-answer game performed according to a fixed protocol. A base line test with no network impairments was executed before starting the tests. The human subjects are asked to rank their subjective perceptual quality for the test cases relative to the base line test. In our experiment, each testcase is tested 5 times and then we obtain the MOS value range from 1(Bad) to 5 (Excellent) as shown in Figure 7-23. The final MOS score will be the arithmetic mean of all the individual scores. We use the midpoint of each range from the delay and MOS score in order to reproduce the tests (i.e.: delay = 275ms for the poor level).
Figure 7-23 MOS scale
We note that our results from these tests can be generalized for any voice codec (e.g.: G.711, G.729, G.723, G.729A, etc). For simplicity, we focus on our testing on 2 common used codecs: G.711 and GSM FR. Our results are shown in Figure 7-24, the test case <d MOSnet, net> is represented on the x axis and its
corresponding average MOS score is shown on the y axis. Interestingly, our tests show that human perception is more sensitive to packet loss than delay; this can be directly observed from the higher MOS score resulted from <P, A> test case than <A, P>. Contrary to our initial expectations, it was observed that the transition from acceptable level of delay to poor is not highly observed from the human perception point of view, while the unacceptable level of delay is more observable and an obvious difference was recognized by the listeners between the poor and unacceptable level of delay; this can be shown in Figure 7-25 by the small slope in the 2 codecs between the following test cases: <A, A>&<P, A>, <A, P>& <P, P>, <A, U>&<P, U> and the higher slope between <P, A>& <U, A>, <P, P>& <U, P>, <P, U>&<U, U>. On the other hand, the transition between acceptable rating factor R (indicating the packet loss level) to poor then unacceptable level is highly recognized by the human perception; this was shown by the large slope between <A, A> & <A, P>&<A, U>, <P, A>&<P, P> &<P, U>, <U, A> & <U, P> & <U, U> as shown Figure 7-25. From our tests, we have noticed also that people prefer in their conversation to have
both poor delay and packet loss rather than having one of them with an unacceptable level as shown in Figure 7-26 where the MOS score of the <P, P> test case is greater than the <A, U> , <P, U>, <U, A> and <U, P>. Since both codecs (G.711, GSM FR) tested subjectively shown in Figure 7-24 show the same trend, thus we can generalize the same results for any voice codec because we are expecting to have same trend for the MOS of different test cases but with different value due to the variation of the performance of different codecs.
Figure 7-24 Subjective testing MOS scores of G711 and GSM codecs
Figure 7-26 Subjective testing MOS scores (Delay transition effect)
In order to define a single metric for our subjective results, we define the term “APU score” as an indication of the conversational call quality from the human perception point of view of each test case that reflects different network conditions. It is now straightforward to deduce Table 7-3 from Figure 7-24. The APU score increases with the increase of the VoIP quality perceived.
Table 7-3 APU scores of each state
State APU score State APU score
AA 9 AU 4
PA 8 PU 3
AP 7 UP 2
PP 6 UU 1