6.2 Validation using Psychophysical Experiments
6.2.1 Experimental Framework
In HDR imaging there are three popular methods used for psychophysical experiments which are: rating [51, 102, 103, 225, 226, 25, 26], ranking [11, 25, 7] and paired comparisons [102, 103, 108, 9]. These methods implicate a necessity for a subject/participant to make simple judgements on selection, grouping, sorting and ordering of stimuli properties such as preference, similarity, etc.
In this study paired comparisons with a reference image was chosen as the methodology. Rating and ranking have some drawbacks compared to paired comparisons.
Rating has three main problems: range effects, frequency effects, and distribution effects. In the first one, range scales can produce different results depending on subjects’ perception of the relationship between the scale and stimuli. In the second one, scales are affected by how many times a subject sees stimuli. The subject needs prior training to gain confidence with scales and stimuli. In the last one, an unequal distribution of test stimuli along the range leads to different scale estimates. Therefore, this method needs an accurate definition of the range scale for each task with pilot studies and a training of subjects. Moreover, a large number of participants and stimuli is required to acquire valid and reliable data [93].
Ranking works well when differences are quite noticeable and fine judgement is not needed. Otherwise many paired comparisons are needed before establishing a ranking. Furthermore, it is not a very accurate method to measure performances because there is no mechanism to determine the difference between an image that comes immediately after another one in the rank. This means that it is not possible to determine whether two different ranks are statistically different or not.
For these reasons, a paired comparison was conducted where each inversely tone mapped image generated by an iTMO was compared with the inverse tone mapped image generated by the other iTMOs and the HDR reference. At any one time, the viewer was presented with three images displayed side by side on the Dolby DR-37P HDR display. The display was calibrated and has a luminance maximum peak of 3,000 cd/m2 and a minimum value of 0.015 cd/m2. The reference was always displayed in the centre and the images produced by two iTMOs were presented on either side, see Figure 6.1. The task of the viewer was to observe the two
Figure 6.1:The setup of experiments. On the left side a diagram showing the setup of the experiment. On the right side an example of a paired comparison using the Dolby DR-37P HDR display. Subjects were required to determine which image (left or right) was most similar to the reference (middle).
images and to indicate which one appeared to be most similar to the reference with respect to a specific criteria which varied across the experiments as explained in Section 6.2.4 and Section 6.2.5.
The main advantage of gathered data using a paired comparison approach is that the degree of agreement amongst observers in their preferences can be calculated. Since each subject was instructed to assess the performance of all possible pairs, the main disadvantage was that a large number of images had to be displayed. On the other hand, this approach is more logical with the advantage of being straightforward. Although, adopting another approach might have been swifter, due to the limited number of inverse tone mapping algorithms published, such a paired comparison was a more suitable choice.
A B M W R Scoreai A - 0 0 0 0 0 B 1 - 0 1 1 3 M 1 1 - 0 0 2 W 0 0 1 - 0 1 R 1 0 1 1 - 3
Table 6.1:An example of the ai jpreference matrix for a subject and an image for the comparisons of iTMOs.
For each reference HDR scene, the total number of possible pairs is(t(t−1)/2)wheret=5 is the number of iTMOs being tested. Therefore, each subject had to evaluate 10 pairs in order to asses all of the combinations for the five iTMOs. For any given pair, the subject had to select which image appeared most similar to the reference. The votes for all participants were then
added into a single preference matrix.
Data collected in pair comparison for each image and participant is recorded in a matrix ai j,
where the valueai j denotes the number of times iTMOiwas preferred to iTMOj. The score of
a class or iTMO,ai, is defined as the number of times that it was picked in a comparison with
all others: ai= t
∑
j=1,j6=i ai,j (6.1)For an example of the matrix and scores see Table 6.1.
Consistency and Agreement Coefficients
There are several methods to analyse paired data. An approach would be to use Thurstone’s Law of Comparative Judgments [190] which is a measuring model for paired comparisons. Such a model however is more appropriate in the case where one would assume that there are perceptual differences between iTMOs. On the other hand, the chosen approach was to analyse the experimental data by primarily testing two “properties”: the individualconsistency
of subjects and the overallagreementamongst them.
Consistency or transitivity is an important aspect to consider with paired comparisons. Due to the nature of the experiment a participant can make inconsistent choices when observing paired data. In the simpler case of evaluating three iTMOs for example, if the participant voted iTMOi to be closer to the reference than iTMOj and the latter closer than iTMOk, then one would assume, by logic, iTMOito be better than iTMOk:
iTMOi //iTMOj iTMOj //iTMOk
then
iTMOi //iTMOk
where→is a closeness operator; operand on the left side is closer to the reference than the one on the right. In the case of a straightforward ranking, this is what would happen. On the other
hand, paired comparisons allow for the case where iTMOk is voted to be better than iTMOi
thus making an inconsistent choice:
iTMOi ( ( Q Q Q Q Q Q Q Q Q Q Q Q Q iTMOk O O iTMOj o o
this situation is calledcircular triad. To determine such inconsistencies theKendall Coefficient of Consistencycan be used [180]. This is defined as follows:
ζ=1− 24c
t(t2−1) whereζ ∈[0,1] (6.2)
where t is the number of iTMOs and c is the number of inconsistencies per subject. ζ is calculated on a per-subject and per-image basis. If ζ is 1, then there are no inconsistencies and the data could be directly expressed as ranks. ζ will move towards zero as the number of inconsistencies increases. From the above, it may appear that straightforward ranking or rating might be more appropriate as it avoids transitivity issues, nevertheless asking an observer to rank an algorithm from first to last is not always natural behaviour. Measuring inconsistency is useful as it provides a clear indication of the similarity of the algorithms. If there are small differences in the iTMOs, high inconsistency would be expected as the task is hard for the observers.
The overall agreement amongst all participants in evaluating all iTMOs for each image can be judged via the Kendall and Babington-Smith’sCoefficient of Agreement[180]. The coefficient is defined as: u= s2Σ 2 t 2 −1 whereu∈[−1,1] (6.3)
where, s is the number of subjects. Σis the sum of number of agreements between pairs of observers: Σ=
∑
i6=j ai j 2 (6.4)Such a coefficient is a suitable measure of association or correlation between a set of ranks. u
represents the amount of agreement among the participants. It has a maximum value of 1 in the case of complete agreement, and -1 in the case of complete disagreement.
Tests of Significance
To reinforce the analysis of the data, a further investigation was conducted on the ratings from subjects via three statistical tests to search for significance. The idea is to firstly test if there was agreement amongst the participants. In the case of agreement, it is tested if iTMOs were perceived as different. Finally, if iTMOs are found to be perceived significantly differently; pairs of iTMOs, that give different perception, are searched. Subsequently, the set of all iTMOs is divided into groups of iTMOs, where iTMOs in same group do not give different perceptual differences and iTMOs in separate groups are perceived as different.
The first test has the goal to determine if the coefficient of agreement,u, is significantly different from the value that would be obtained if the comparisons were randomly made thus with no agreement amongst subjects. This formally corresponds to test the following null hypothesis,
H0: µ =0, against H1: µ 6=0, whereµ is the population coefficient of agreement, i.e. the coefficient if the whole population had rated the iTMOs. For a small number of observers (less than 6) and a small number of iTMOs (less than or equal to 8) the critical values ofuare tabulated in statistics books, see for example Table U in [180]. On the other hand, for a large number of subjects,s, and iTMOs,t, the following test statistic is computed:
X2=t(t−1)(1+u(s−1))
2 (6.5)
which approximates a chi-square distribution with t2
degrees of freedom (d f). The critical values are tabulated in most statistical textbooks, see for example Table C in [180]. If the value of the test statistic X2, Equation 6.5, is greater or equal to the tabulated critical values at a specific level of significance α the null hypothesis can be rejected and confirm that there is strong agreement amongst observers when observing the various algorithms compared to the reference. Atα=.05 and 10d f the critical value is 18.31.
After the agreement amongst subjects is concluded, the overall test of equality across iTMOs was performed. This test determines whether differences in scores are purely by chance or if indeed the differences are significant. This is formally equal to test the following null hypoth- esis:
H0: πiTMOi =
1
2 ∀ iTMOi (6.6)
whereπiTMOi is the unknown probability that the iTMOi is preferred to another iTMO. This
is achieved by checking if there are differences in the scoresai across the iTMOs. For each
image, the standardised sum of squares of the scoresSis computed as:
S= 4 st t
∑
i=1 a2i −1 4ts 2(t −1)2 (6.7)and compared with Sc, the upper α critical value of the χ2 distribution witht−1 degrees of
freedom. If the observedSvalue is greater than or equal to the corresponding critical valueSc,
the null hypothesisH0is rejected. For example, fromχ2tables, fort=5 at 0.05 significance level, the critical valueSc=9.45.
After a good agreement in rating across subjects and disagreement in ratings across iTMOs are concluded, pairs of algorithms that had significantly different ratings were investigated. Par- ticularly if one algorithm is consistently preferred against the others. A natural way to do this is to calculate observed differences in ratings between two algorithms and to test them to see if these observed differences in scores are due to chance or if they are statistically significant. Since fortiTMOs there existst(t−1)/2 pairs of iTMOs, multiple (i.e. t(t−1)/2) values of paired differences need to be calculated and hence a multiple comparison test is needed. In this case, the critical value,R, for determining the statistical significance, is defined as:
R=1
2Wt,α √
st+1
4 (6.8)
whereWt,α is the upper(1−α)quantile of theWt distribution. At a significance level of 0.05
andt=5Wt,α=3.86, see Table 22 in [157]. This procedure is based on the range of the scores obtained by the five iTMOs and it is equivalent to Tukey’s multiple comparison correction used
in ANOVA [43]. Note that the value ofRcan be used to generate different groups of iTMOs based on significance.