Method - Research System: E-meter - Accurate, Fair, and Explainable: Building Human-Centered AI

5.3 Research System: E-meter

5.6.1 Method

We use the same E-meter system as the previous conditions but in a more naturalistic deployment. In this study, participants used the E-meter system twice. The E-meter in this study generally provided document-only feedback condition but would display word-level feedback 3 different times when the user was writing; this feedback displayed when the user was 1/3rd through the writing task, 2/3rds, and then also when they were almost complete. The user answered a very short questionnaire, to assess expectation violation before seeing the word- level transparency, they then viewed the transparency as long as they wished—during which point they could not continue writing, and then pressed a button labeled “Press this button to

turn off highlighting and continue writing” which they pressed to continue the task. 5.6.1.1 Users

We recruited 53 users to test the E-meter system who had previously passed a short mental health screening (PGWBI) [82]. Users were recruited from Amazon Turk and paid $3.33. The evaluation took 17 minutes on average. This study was approved by an Institutional Review Board.

5.6.1.2 Measures

The system was implemented to record how long the users took to examine the transparency before they continued with the writing task. We will refer to this timing as transparency view time. Before the users viewed the transparency they were asked how far the current document-level rating was from the user’s own evaluation of their content, we refer to this as the in-the-moment expectation violation, and this is measured multiple times per session. After seeing the transparency users were asked about how their understanding changed due to viewing the transparency. Finally after using the system twice, participants were asked about their overall perceptions of the E-meter’s accuracy, trust, and their mental model of the system’s operation.

5.6.2 Results

Confirming the result of studies 1 and 2 we again found that word-level feedback promotes more accurate mental models. In this study 52 out of 53 participants had accurate

Variable Coefficient Std. Error p-value Intercept 8.25 0.16 < 0.0001 Expectation

Violation

0.07 0.03 0.03

Table 5.2: Effect of Expectation Violation on Transparency View Time

mental models of the system where they understood that the system rated individual words as positive or negative and used these to calculate an overall rating. One example of this accurate mental model is given by Participant 53: “It uses key words in the writing and the frequency and ratio of these word to calculate how positive or negative a passage is”. The one remaining participant who did not have an accurate mental model felt that the word highlighting was assigned at random and the meter moved randomly also.

We find that participants who experience expectation violation spend more time examining transparency than those who experience less or no expectation violation. This is a within subjects experiment; we measured expectation violation 3 times per trial with 2 trials per experiment so we analyze the experiment using a linear mixed model with crossed random effects from the R package lme4. The random effects are to control for intra-participant variance and intra-position variance (essentially order effects) and these are crossed because each participant experiences each position. The fixed effect in the model is the in-the-moment expectation violation, because we expected that users would spend more time examining explanations when their expectations are violated. Thus, the dependent variable is the measured time the user took with the explanation. Before fitting the model, we take the natural log of the time variable because it is distributed exponentially.

Figure 5.6: Transparency View Time By Explanation Position

The model is presented in Table 5.2. We note that the inclusion of the random effects in the model are significant at p < .0001 as determined by a likelihood-ratio test comparing the given model with a comparison model that drops either the participant or position random effect. This means there is significant variation between participants and also between the order effects within each participant.

Additionally, expectation violation has a significant effect on in-the-moment expectation violation. A one unit increase in expectation violation is associated with a 7.25% increase in transparency view time. Given that expectation violation ranges from [0,3], the maximum expectation violation is associated with a 23% increase in transparency view time. The model itself overall is very predictive with R2= .48.

We were interested in further examining the effects of order on transparency view time. As noted above, including the random effect for position was significant, indicating that there are significant differences in transparency view time based on what part of the task the user

Figure 5.7: User Improvement in Understanding By Explanation Position

was in. In Fig 5.6 we plot the transparency view time according to position the user saw. The first position, where the user first sees the transparency, greatly exceeds all other view times. The fourth position also has a higher mean than the surrounding positions though the variance here is very large; recall that this fourth position is the first time the users sees the transparency for the second example of their writing. Another thing to note is that the transparency view time decreases over time; we see that users spend less time with transparency for subsequent viewings for the same example, but also decreasing between examples over time.

Next we examine how the users understanding changes over time. Recall that we are have an ordinal 7-point Likert item outcome for understanding change from ‘Strongly De- creased My Understanding’ to ‘Strongly Increased my Understanding’ that users answer after each time they see the transparency. We find that user responses are significantly different using a Kruskal Wallis Rank Sum test χ2(5) = 16.94, p < .01. Visually examining the differences shown in Fig 5.7 shows that the first time users view the transparency there are improvements in understanding for the majority of people. Subsequent viewings of transparency are more

neutral overall, a sizable contingent of users still gain understanding but many feel their understanding is not changed either way by the transparency. A very small amount of users find that their understanding changes negatively when viewing the transparency.

5.6.3 Discussion

We examined when users want to receive transparency in a naturalistic experiment. Our primary outcome of interest is how long users examine transparency when they are shown it. We find that users spend significantly more time with transparency when their expectations are violated. Additionally, we see strong effects of order in how much time users spend with transparency. On the first viewing of transparency, users spend significantly more time. This time spent with transparency decreases over time as are likely forming more accurate mental models of the system. There is some indication that showing transparency on new examples (position 4 in Fig 5.6) may lead users to spending more time with the transparency but further research is needed.

In document Accurate, Fair, and Explainable: Building Human-Centered AI (Page 133-138)